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Preface 



The 16th Workshop on Languages and Compilers for Parallel Computing was 
held in October 2003 at Texas A&M University in College Station, Texas. It 
was organized by the Parasol Lab and the Department of Computer Science at 
Texas A&M and brought together almost 100 researchers from academia and 
from corporate and government research institutions spanning three continents. 

The program of 35 papers was selected from 48 submissions. Each paper 
was reviewed by at least two program committee members and, in many cases, 
by additional reviewers. Prior to the workshop, revised versions of accepted 
papers were informally published on the workshop’s Web site and on a CD 
that was distributed at the meeting. This year, the workshop was organized 
into sessions of papers on related topics, and each session consisted of an initial 
segment of 20-minute presentations followed by an informal 20-minute panel 
and discussion between the attendees and all the session’s presenters. This new 
format both generated many interesting and lively discussions and reduced the 
overall time needed per paper. Based on feedback from the workshop, the papers 
were revised and submitted for inclusion in the formal proceedings published in 
this volume. The informal proceedings and presentations will remain available 
at the workshop Web site: parasol.tamu.edu/lcpc03 

This year’s experience was enhanced by the pleasant environment offered by 
the Texas A&M campus. Different venues were selected for each day and meals 
were served at various campus locales, ranging from a fajitas lunch in the Kyle 
Field Press Box, to a Texas barbeque dinner on the alumni center lawn. The 
banquet was held at Messina Hof, a local winery, and was preceded by a widely 
attended tour and wine tasting session. 

The success of LCPC 2003 was due to many people. We would like to thank 
the Program Committee members for their timely and thorough reviews and the 
LCPC Steering Committee (especially David Padua) for providing invaluable ad- 
vice and continuity for LCPC. The Parasol staff (especially Kay Jones) did an 
outstanding job with the local arrangements and workshop registration and the 
Parasol students (especially Silvius Rus, Tim Smith, and Nathan Thomas) pro- 
vided excellent technical services (wireless internet, presentation support, elec- 
tronic submission, Web site, proceedings) and local transportation, and just 
generally made everyone feel at home. 

Last, but certainly not least, we are happy to thank Microsoft Research and 
Steve Waters from Microsoft University Relations for sponsoring the banquet 
and Dr. Frederica Darema’s program at the National Science Foundation for 
providing a substantial travel grant for LCPC attendees. 
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Search Space Properties for Mapping 
Coarse-Grain Pipelined FPGA Applications* 



Heidi Ziegler, Mary Hall, and Byoungro So 

University of Southern California / Information Sciences Institute 
4676 Admiralty Way, Suite 1001 
Marina del Rey, California, 90292 
{ziegler ,mhall ,bso}@isi . edu 



Abstract. This paper describes an automated approach to hardware 
design space exploration, through a collaboration between parallelizing 
compiler technology and high-level synthesis tools. In previous work, we 
described a compiler algorithm that optimizes individual loop nests, ex- 
pressed in C, to derive an efficient FPGA implementation. In this paper, 
we describe a global optimization strategy that maps multiple loop nests 
to a coarse-grain pipelined FPGA implementation. The global optimiza- 
tion algorithm automatically transforms the computation to incorporate 
explicit communication and data reorganization between pipeline stages, 
and uses metrics to guide design space exploration to consider the im- 
pact of communication and to achieve balance between producer and 
consumer data rates across pipeline stages. We illustrate the components 
of the algorithm with a case study, a machine vision kernel. 



1 Introduction 

The extreme flexibility of Field Programmable Gate Arrays (FPGAs), coupled 
with the widespread acceptance of hardware description languages such as VHDL 
or Verilog, has made FPGAs the medium of choice for fast hardware prototyping 
and a popular vehicle for the realization of custom computing machines that tar- 
get multi-media applications. Unfortunately, developing programs that execute 
on FPGAs is extremely cumbersome, demanding that software developers also 
assume the role of hardware designers. 

In this paper, we describe a new strategy for automatically mapping from 
high-level algorithm specifications, written in C, to efficient coarse-grain pipe- 
lined FPGA designs. In previous work, we presented an overview of DEFACTO, 
the system upon which this work is based, which combines parallelizing compiler 
technology in the Stanford SUIF compiler with hardware synthesis tools [12]. 
In [21] we presented an algorithm for mapping a single loop nest to an FPGA 
and a case study [28] describing the communication and partitioning analysis 

* This work is funded by the National Science Foundation (NSF) under Grant CCR- 
0209228, the Defense Advanced Research Project Agency under contract number 
F30603-98-2-0113, and the Intel Corporation. 
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necessary for mapping a multi-loop program to multiple FPGAs. In this paper, 
we combine the optimizations applied to individual loop nests with analyses and 
optimizations necessary to derive a globally optimized mapping for multiple loop 
nests. This paper focuses on the mapping to a single FPGA, incorporating more 
formally ideas from [28] such as the use of matching producer and consumer 
rates to prune the search space. 

As the logic, communication and storage are all configurable, there are many 
degrees of freedom in selecting the most appropriate implementation of a compu- 
tation, which is also constrained by chip area. Further, due to the complexity of 
the hardware synthesis process, the performance and area of a particular design 
cannot be modelled accurately in a compiler. For this reason, the optimization 
algorithm involves an iterative cycle where the compiler generates a high-level 
specification, synthesis tools produce a partially synthesized result, and estimates 
from this result are used to either select the current design or guide generation 
of an alternative design. This process, which is commonly referred to as design 
space exploration , evaluates what is potentially an exponentially large search 
space of design alternatives. As in [21], the focus of this paper is a characteri- 
zation of the properties of the search space such that exploration considers only 
a small fraction of the overall design space. 

To develop an efficient design space exploration algorithm for a pipelined 
application, this paper makes several contributions: 

— Describes the integration of previously published communication and 
pipelining analyses [27] with the single loop nest design space exploration 
algorithm [21]. 

— Defines and illustrates important properties of the design space for the global 
optimization problem of deriving a pipelined mapping for multiple loop nests. 

— Exploits these properties to derive an efficient global optimization algorithm 
for coarse-grained pipelined FPGA designs. 

— Presents the results of a case study of a machine vision kernel that demon- 
strate the impact of on-chip communication on improving the performance 
of FPGA designs. 

The remainder of the paper is organized as follows. In the next section we 
present some background on FPGAs and behavioral synthesis. In section 3, 
we provide an overview of the previously published communication analysis. In 
section 4, we describe the optimization goals of our design space exploration. In 
section 5 we discuss code transformations applied by our algorithm. We present 
the search space properties and a design space exploration algorithm in section 6. 
We map a sample application, a machine vision kernel in section 7. Related work 
is surveyed in section 8 and we conclude in section 9. 

2 Background 

We now describe FPGA features of which we take advantage and we also com- 
pare hardware synthesis with optimizations performed in parallelizing compilers. 
Then we outline our target application domain. 
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#define IMAGE 16 

int u[IMAGE] [IMAGE]; v [IMAGE] [IMAGE]; 
int peak[IMAGE] [IMAGE]; 
int feature_x[IMAGE] [IMAGE]; 
int feature y [IMAGE] [IM AGE] ; 
int th, uhl, uh2; 

/* stage si. Apply Prewitt Edge Detector */ 
for(x = 0; x < IMAGE-3; x++){ 
for(y = 0; y < IMAGE-3; y++){ 

1. uhl= -3*u[x][y] ; 

2. uh2= 3*u[x][y] + • ■ •; 

3. peak[x][y] = (uhl + uh2); 

} 

} 

/* stage S2. Find Features - threshold */ 
for(x = 0; x < IMAGE-3; x++){ 
for(y = 0; y < IMAGE-3; y+- 1-){ 

4. if(peak[x][y] > th){ 

5. feature_x[x][y] = x; 

6. feature_y [x] [y] = y; 

} else { 

7. feature _x [x] [y] = 0; 

8. feature_y[x][y] = 0; 

} 

} 

} 

/* stage S3. Compute sum-square difference */ 
for(x = 0; x < IMAGE-2; x++){ 
for(y = 0; y < IMAGE-2; y++){ 

9. if(feature_x[x] ]y] != 0) 

10. ssd[x][y] = 

(u[x] [y]-v[x] [y+1] ) *(u[x] [y]-v[x] [y+1] ) 



for(x = 0; x < IMAGE-3; x+=2){ 
for(y = 0; y < IMAGE-3; y+=2){ 
if (th < peak[x] [y]) { 
feature_x_l_0_0 = x; 
feature_y_l_l_0 = y; 

} else { 

feature_x_l_0_0 = 0; 
feature_y_l_l_0 = 0; 

} 

feature_y[x][y] = feature_y_l_l_0; 
feature _x [x] [y] — feature _x_l_0_0; 

} 

} 

for(x = 0; x < (IMAGE-3)/2; x+=2){ 
for(y = 0; y < (IMAGE-3)/2; y+=2){ 

1. uhl= -3*u[x][y] ; 

2. uh2= 3*u[x][y] + ■ ■ ■; 

3. peak[x][y] = (uhl + uh2); 

4. uhl=-3*u[x+l][y] - ■■■; 

5. uh2= 3*u[x+l][y] + ••■; 

6. peak[x+l][y] = (uhl + uh2); 

7. uhl= -3*u[x][y+l] ; 

8. uh2= 3*u[x][y+l] + ■■■; 

9. peak[x][y+l] = (uhl + uh2); 

10. uhl= -3*u[x+l][y+l] ; 

11. uh2= 3*u[x+l][y+l] + ■ • ■; 

12. peak[x+l][y+l] = (uhl + uh2); 



} 



} 



} 



} 



Fig. 1. MVIS Kernel with Scalar Replacement (S2) and Unroll and Jam (SI) 



2.1 Field Programmable Gate Arrays and Behavioral Synthesis 

FPGAs are a popular vehicle for rapid prototyping. Conceptually, FPGAs are 
sets of reprogrammable logic gates. Practically, for example, the Xilinx Spartan- 
3 family of devices consists of 33,280 device slices [26]; two slices form a config- 
urable logic block. These blocks are interconnected in a 2-dimensional mesh. As 
with traditional architectures, bandwidth to external memory is a key perfor- 
mance bottleneck in FPGAs, since it is possible to compute orders of magnitude 
more data in a cycle than can be fetched from or stored to memory. However, 
unlike traditional architectures, FPGAs allow the flexibility to devote internal 
configurable resources either to storage or to computation. 
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Configuring an FPGA involves synthesizing the functionality of the slices and 
chip interconnect. Using hardware description languages such as VHDL or Ver- 
ilog, designers specify desired functionality at a high level of abstraction known 
as a behavioral specification as opposed to a low level or structural specification. 

The process of taking a behavioral specification and generating a low level 
hardware specification is called behavioral synthesis. While low level optimiza- 
tions such as binding, allocation and scheduling are performed during synthesis, 
only a few high level, local optimizations, such as loop unrolling, may be per- 
formed when directed by the programmer. Subsequent synthesis phases produce 
a device configuration file. 

2.2 Target Application Domain 

Due to their customizability, FPGAs are commonly used for applications that 
have significant amounts of fine-grain parallelism and possibly can benefit from 
non-standard numeric formats. Specifically, multimedia applications, including 
image and signal processing on 8-bit and 16-bit data, respectively, are applica- 
tions that map well to FPGAs. 

Fortunately, this domain of applications maps well to the capabilities of cur- 
rent parallelizing compiler analyses, that are most effective in the affine domain, 
where array subscript expressions are linear functions of the loop index vari- 
ables and constants [25]. In this paper, we restrict input programs to loop nest 
computations on array and scalar variables (no pointers), where all subscript 
expressions are affine with a fixed stride. The loop bounds must be constant. 1 
We support loops with control flow, but to simplify control and scheduling, the 
generated code always performs conditional memory accesses. 

We illustrate the concepts discussed in this paper using a synthetic bench- 
mark, a machine vision kernel, depicted in Figure 1. For clarity, we have omitted 
some initialization and termination code as well as some of the numerical com- 
plexity of the algorithm. The code is structured as three loop nests nested inside 
another control loop (not shown in the figure) that process a sequence of image 
frames. The first loop nest extracts image features using the Prewitt edge detec- 
tor. The second loop nest determines where the peaks of the identified features 
reside. The last loop nest computes a sum square-difference between two consec- 
utive images (arrays u and v). Using the data gathered for each image, another 
algorithm would estimate the position and velocity of the vehicle. 

3 Communication and Pipelining Analyses 

A key advantage of parallelizing compiler technology over behavioral synthesis 
is the ability to perform data dependence analysis on array variables. Analyzing 

1 Non-constant bounds could potentially be supported by the algorithm, but the gen- 
erated code and resulting FPGA designs would be much more complex. For exam- 
ple, behavioral synthesis would transform a for loop with a non-constant bound to 
a while loop in the hardware implementation. 
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communication requirements involves characterizing the relationship between 
data producers and consumers. This characterization can be thought of as a 
data-flow analysis problem. Our compiler uses a specific array data-flow analysis, 
reaching definitions analysis [2], to characterize the relationship between array 
accesses in different pipeline stages [15]. This analysis is used for the following 
purposes: 

— Mapping each loop nest or straight line code segment to a pipeline stage. 

— Determining which data must be communicated. 

— Determining the possible granularities at which data may be communicated. 

— Selecting the best granularity from this set. 

— Determining the corresponding communication placement points within the 
program. 

We combine reaching definitions information and array data-flow analysis for 
data parallelism [3] with task parallelism and pipelining information and capture 
it in an analysis abstraction called a Reaching Definition Data Access Descrip- 
tor (RDAD). RDADs are a fundamental extension of Data Access Descriptors 
(DADs) [7], which were originally proposed to detect the presence of data depen- 
dences either for data parallelism or task parallelism. We have extended DADs 
to capture reaching definitions information as well as summarize information 
about the read and write accesses for array variables in the high-level algorithm 
description, capturing sufficient information to automatically generate commu- 
nication when dependences exist. Such RDAD sets are derived hierarchically by 
analysis at different program points, i.e., on a statement, basic block, loop and 
procedure level. Since we map each nested loop or intervening statements to a 
pipeline stage, we also associate RDADs with pipeline stages. 

Definition 1 A Reaching Definition Data Access Descriptor, RDAD(A), de- 
fined as a set of 5-tuples (a|r|J|a;j 7 ), describes the data accessed in 
the m-dimensional array A at a program point s, where s is either a basic block, 
a loop or pipeline stage, a is an array section describing the accessed elements of 
array A represented by a set of integer linear inequalities, r is the traversal order 
of a, a vector of length < m, with array dimensions from (1, • • • , m) as elements, 
ordered from slowest to fastest accessed dimension. A dimension traversed in re- 
verse order is annotated as i. An entry may also be a set of dimensions traversed 
at the same rate. S is a vector of length m and contains the dominant induction 
variable for each dimension, to is a set of definition or use points for which a 
captures the access information. 7 is the set of reaching definitions. We refer 
to RDAD r ^ s (A) as the set of tuples corresponding to the reads of array A and 
RDAD WtS (A) as the set of writes of array A at program point s. Since writes 
do not have associated reaching definitions, for all RDAD W:S (A) , 7 = 0. 

After calculating the set of RDADs for a program, we use the reaching defi- 
nitions information to determine between which pipeline stages communication 
must occur. To generate communication between pipeline stages, we consider 
each pair of write and read RDAD tuples where an array definition point in the 
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RDAD W , S1 (peak) 
RDAD r , S2 (peak) = 



0 < dl < 13 
0 < d2 < 13 



(1,2) 



(x,y) 



{3} 



0 < dl < 13 
0 < d2 < 13 



(1,2) 



(x,y) 



{4} 



{3} 



Fig. 2. MVIS Kernel Communication Analysis 



sending pipeline stage is among the reaching definitions in the receiving pipeline 
stage. The communication requirements, i.e., placement and data, are related 
to the granularity of communication. We calculate a set of valid granularities, 
based on the comparison of traversal order information from the communicating 
pipeline stages, and then evaluate the execution time for each granularity in the 
set to find the best choice. We define another abstraction, the Communication 
Edge Descriptor (CED), to describe the communication requirements on each 
edge connecting two pipeline stages. 

Definition 2 A Communication Edge Descriptor (CED), CED Si —, Sj (A), de- 
fined as a set of 3-tuples ( a | A | p ) , describes the communication that must 
occur between two pipeline stages Si and Sj. a is the array section, represented 
by a set of integer linear inequalities, that is transmitted on a per communica- 
tion instance. A and p are the communication placement points in the send and 
receive pipeline stages respectively. 

Figure 2 shows the calculated RDADs for pipeline stages <S1 and S 2, for 
array peak. The RDAD reaching definitions for array peak from pipeline stage 
SI to S 2 imply that communication must occur between these two stages. From 
the RDAD traversal order tuples, r = ( 1, 2 ) we see that both arrays are 
accessed in the same order in each stage and we may choose from among all 
possible granularities, e.g. whole array, row, and element. We calculate a CED 
for each granularity, capturing the data to be communicated each instance and 
the communication placement. We choose the best granularity, based on total 
program execution time, and apply code transformations to reflect the results 
of the analysis. The details of the analysis are found in [27]. Figure 3 shows the 
set of CEDs representing communication between stages 51 and 52. 

4 Optimization Strategy 

In this section, we set forth our strategy for solving the global optimization 
problem. We briefly describe the criteria, behavioral synthesis estimates, and 
metrics used for local optimization, as published in [21, 20] and then describe 
how we build upon these to find a global solution. A high-level design flow is 
shown in Figure 4. The shaded boxes represent a collection of transformations 
and analyses, discussed in the next section, that may be applied to the program. 
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CED ai ^ S2 {peak) = 



0 < dl < 13 
0 < d2 < 13 



0 ) CED sl -> S2 (peak) = 



dl = x 
0 < d2 < 13 



(a) Total Array-sized 



(b) Row-sized 



CED ai ^ S2 (peak) = 



dl = x 



CED ai ^ S2 {peak) = 



d2 = y 

(c) Element-sized (d) Best 

Fig. 3. MVIS Kernel Communication Analysis 



dl = x 
0 < d2 < 13 




Fig. 4. High Level Optimization Algorithm 



The design space exploration algorithm involves selecting parameters for a set 
of transformations for the loop nests in a program. By choosing specific unroll 
factors and communication granularities for each loop nest or pair of loop nests, 
we partition the chip capacity and ultimately the memory bandwidth among 
the pipeline stages. The generated VHDL is input into the behavioral synthesis 
compiler to derive performance and area estimates for each loop nest. From this 
information, we use balance and efficiency [21], along with our 2 optimization 
criteria to tune the transformation parameters. 
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The two optimization criteria, for mapping a single loop nest, 

1. a design’s execution time should be minimized 

2. a design’s space usage, for a given performance, should be minimized 

are still valid for mapping a pipelined computation to an FPGA but the way 
in which we calculate the input and evaluate these criteria has changed. The 
area(d) of design d, related to criterion 2, is a summation of the individual be- 
havioral synthesis estimates of the FPGA area used for the data path, control and 
communication for each pipeline stage in this design. The time(d) of design d , 
related to criterion 1, is a summation of the behavioral synthesis estimates for 
each pipeline stage of the number of cycles it takes to run to completion, includ- 
ing the time used to communicate data and excluding time saved by the overlap 
of communication and computation. 

5 Transformations 

We define a set of transformations, widely used in conventional computing, that 
permit us to adjust computational and memory parallelism in FPGA-based 
systems through a collaboration between parallelizing compiler technology and 
high-level synthesis. To meet the optimization criteria set forth in the previous 
section, we have reduced the optimization process to a tractable problem, that 
of selecting a set of parameters, for local transformations applied to a single loop 
nest or global transformations applied to the program as a whole, that lead to 
a high-performance, balanced, and efficient design. 

5.1 Transformations for Local Optimization 

Unroll and Jam Due to the lack of dependence analysis in synthesis tools, 
memory accesses and computations that are independent across multiple itera- 
tions must be executed in serial. Unroll and jam [9], where one or more loops 
in the iteration space are unrolled and the inner loop bodies are fused together, 
is used to expose fine-grain operator and memory parallelism by replicating the 
logical operations and their corresponding operands in the loop body. Following 
unroll-and-jam, the parallelism exploited by high-level synthesis is significantly 
improved. 



Scalar Replacement This transformation replaces array references by accesses 
to temporary scalar variables, so that high-level synthesis will exploit reuse in 
registers. Our approach to scalar replacement closely matches previous work [9]. 
There are, however, two differences: (1) we also eliminate unnecessary memory 
writes on output dependences; and, (2) we exploit reuse across all loops in the 
nest, not just the innermost loop. We peel iterations of loops as necessary to 
initialize registers on array boundaries. Details can be found in [12] . 
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Custom Data Layout This code transformation lays out the data in the 
FPGA’s external memories so as to maximize memory parallelism. The com- 
piler performs a 1-to-l mapping between array locations and virtual memories 
in order to customize accesses to each array according to their access patterns. 
The result of this mapping is a distribution of each array across the virtual 
memories such that opportunities for parallel memory accesses are exposed to 
high-level synthesis. Then the compiler binds virtual memories to physical mem- 
ories, taking into consideration accesses by other arrays in the loop nest to avoid 
scheduling conflicts. Details can be found in [22]. 

5.2 Transformations for Global Optimization 

Communication Granularity and Placement With multiple, pipelined 
tasks ( i.e loop nests), some of the input/output data for a task may be directly 
communicated on chip, rather than requiring reading and/or writing from/to 
memory. Thus, some of the memory accesses assumed in the optimization of 
a single loop nest may be eliminated as a result of communication analysis. 

The previously-described communication analysis selects the communication 
granularity that maximizes the overlap of communication and computation, 
while amortizing communication costs over the amount of data communicated. 
This granularity may not be ideal when other issues, such as on-chip space con- 
straints, are taken into account. For example, if the space required for on-chip 
buffering is not available, we might need to choose a finer granularity of commu- 
nication. In the worst case, we may move the communication off-chip altogether. 

Data Reorganization On-Chip As part of the single loop solution, we calcu- 
lated the best custom data layout for each accessed array variable, allowing for 
a pipeline stage to achieve its best performance. When combining stages that 
access the same data either via memory or on-chip communication on the same 
FPGA, the access patterns for each stage may be different and thus optimal 
data layouts may be incompatible. One strategy is to reorganize the data be- 
tween loop nests to retain the locally optimal layouts. In conventional systems, 
data reorganization can be very expensive in both CPU cycles and cache or mem- 
ory usage, and as a result, usually carries too much overhead to be profitable. In 
FPGAs, we recognize that the cost of data reorganization is in many cases quite 
low. For data communicated on-chip between pipeline stages that is already con- 
suming buffer space, the additional cost of data reorganization is negligible in 
terms of additional storage, and because the reorganization can be performed 
completely in parallel on an FPGA, the execution time overhead may be hidden 
by the synchronization between pipeline stages. The implementation of on-chip 
reorganization involves modifying the control in the finite state machine for each 
pipeline stage, which is done automatically by behavioral synthesis; the set of 
registers containing the reorganized array will simply be accessed in a different 
order. The only true overhead is the increased complexity of routing associated 
with the reorganization; this in turn would lead to increased space used for 
routing as well as a potentially slower achieved clock rate. 
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6 Search Space Properties 

The optimization involves selecting unroll factors, due to space and performance 
considerations, for the loops in the nest of each pipeline stage. Our search is 
guided by the following observations about the impact of the unroll factor and 
other optimizations for a single loop in the nest. In order to define the global 
design space, we discuss the following observations: 

Observation 1 As a result of applying communication analysis , the number 
of memory accesses in a loop is non-increasing as compared to the single loop 
solution without communication. 

The goal of communication analysis is to identify data that may be commu- 
nicated between pipeline stages either using an on or off-chip method. The data 
that may now be communicated via on-chip buffers would have been communi- 
cated via off-chip memory prior to this analysis. 

Observation 2 Starting from the design found by applying the single loop with 
communication solution, the unroll factors calculated during the global optimiza- 
tion phase will be non-increasing. 

We start by applying the single loop optimizations along with communication 
analysis. We assume that this is the best balanced solution in terms of memory 
bandwidth and chip capacity usage. We also assume that the ratio of performance 
to area has the best efficiency rating as compared to other designs investigated 
during the single loop exploration phase. Therefore, we take this result to be 
the worst case space estimate and the best case performance achievable by this 
stage in isolation; unrolling further would not be beneficial. 

Observation 3 When the producer and consumer data rates for a given com- 
munication event are not equal, we may decrease the unroll factor of the faster 
pipeline stage to the point at which the rates are equal. We assume that reducing 
the unroll factor does not cause this pipeline stage to become the bottleneck. 

When comparing two pipeline stages between which communication occurs, 
if the rates are not matched, the implementation of the faster stage may be using 
an unnecessarily large amount of the chip capacity while not contributing to the 
overall performance of the program. This is due to the fact that performance 
is limited by the slower pipeline stage. We may choose a smaller unroll factor 
for the faster stage such that the rates match. Since the slower stage is the 
bottleneck, choosing a smaller unroll factor for the faster stage does not affect 
the overall performance of the pipeline until the point at which the faster stage 
becomes the slower stage. 

Finally, if a pipeline stage is involved in multiple communication events, we 
must take care to decrease the unroll factor based on the constraints imposed 
by all events. We do not reduce the unroll factor of a stage to the point that it 
becomes a bottleneck. 
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Fig. 5. MVIS Task Graph 



6.1 Optimization Algorithm 

At a high-level, the design space exploration algorithm involves selecting param- 
eters for a set of transformations for the loop nests in a program. By choosing 
specific unroll factors and communication granularities for each loop nest or 
pair of loop nests, we partition the chip capacity and ultimately the memory 
bandwidth among the pipeline stages. The generated VHDL is input into the 
behavioral synthesis compiler to derive performance and area estimates for each 
loop nest. From this information, we can tune the transformation parameters to 
obtain the best performance. 

The algorithm represents a multiple loop nest computation as an acyclic task 
graph to be mapped onto a pipeline with no feedback. To simplify this discussion, 
we describe the task graph for a single procedure, although interprocedural task 
graphs are supported by our implementation. Each loop nest or computation 
between loop nests is represented as a node in the task graph. Each has a set of 
associated RDADs. Edges, each described by a CED, represent communication 
events between tasks. There is one producer and one consumer pipeline stage 
per edge. The task graph for the MVIS kernel is shown in Figure 5. Associated 
with each task is the unroll factor for the best hardware implementation, area 
and performance estimates, and balance and efficiency metrics. 

1. We apply the communication and pipelining analyses to 1) define the stages 
of the pipeline and thus the nodes of the task graph and 2) identify data 
which could be communicated from one stage to another and thus define the 
edges of the task graph. 

2. In reverse topological order, we visit the nodes in the task graph to identify 
communication edges where producer and consumer rates do not match. 
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From Observation 3, if reducing a producer or consumer rate does not cause 
a task to become a bottleneck in the pipeline, we may modify it. 

3. We compute the area of the resulting design, which we currently assume is the 
sum of the areas of the single loop nest designs, including the communication 
logic and buffers. If the space utilization exceeds the device capacity, we 
employ a greedy strategy to reduce the area of the design. We select the 
largest task in terms of area, and reduce its unroll factor. 

4. Repeat steps two and three until the design meets the space constraints of 
the target device. 

Our initial algorithm employs a greedy strategy to reduce space constraints, 
but other heuristics may be considered in future work, such as reducing space 
of tasks not on the critical path, or using the balance and efficiency metrics to 
suggest which tasks will be less impacted by reducing unroll factors. 

7 Experiments 

We have implemented the loop unrolling, the communication analysis, scalar re- 
placement, data layout, the single loop design space exploration and the trans- 
lation from SUIF to behavioral VHDL such that these analyses and transforma- 
tions are automated. Individual analysis passes are not fully integrated, requiring 
minimal hand intervention. 

We examine how the number of memory accesses has changed when compar- 
ing the results of the automated local optimization and design space exploration 
with and without applying the communication analyses. In Table 1 we show 
the number of memory accesses in each pipeline stage before and after apply- 
ing communication analysis. The rows entitled Accesses Before and After are 
the results without and with communication analysis respectively. As a result 
of the communication analysis, the number of memory accesses greatly declines 
for all pipeline stages. In particular, for pipeline stage 5 2, the number of mem- 
ory accesses goes to zero because all consumed data is communicated on-chip 
from stage 51 and all produced data is communicated on-chip to stage 53. This 
should have a large impact on the performance of the pipeline stage. For pipe- 
line stages 51 and 53, the reduction in the number of memory accesses may 
be sufficient to transform the pipeline stage from a memory bound stage into 
a compute bound stage. This should also improve performance of each pipeline 
stage and ultimately the performance of the total program. 



Table 1 . Memory Access Reduction 



Pipeline Stage 


1 


2 


3 


Accesses Before 


49 


117 


45 


Accesses After 


2 


0 


6 



Search Space Properties 



13 



From the design space exploration for each single loop, we would choose 
unroll factors of 4, 4, and 2 for pipeline stages 51, 52, and 53. This is based on 
both the metrics and estimates as explained in [28] . 

We then apply the design space exploration with global optimizations. Since 
the sum of the areas, 306K Monet space units, for the implementation for all 
three pipeline stages with the previously mentioned unroll factors is larger than 
the total area of the chip (150K), we must identify one or more pipeline stages for 
which to decrease the unroll factors. We apply the second step of our algorithm, 
which matches producer and consumer rates throughout the pipeline. Since 53 
is the bottleneck when comparing the rates between stages 52 and 53, we know 
that we may reduce the unroll factor of stage 52 to 2 without affecting the 
pipeline performance. Then, our algorithm will detect a mismatch between stages 
51 and 52. Again, we may decrease the unroll factor of stage 51 from 4 to 2 
without affecting performance. Then we perform the analyses once again on each 
pipeline stage, using the new unroll factor of 2 for all pipeline stages. The size 
of the resulting solution is 103K Monet units. We are now within our space 
constraint. 

In summary, by eliminating memory accesses through scalar replacement and 
communication analysis, and by then matching producer and consumer data 
rates for each pipeline stage, we were able to achieve a good mapping while 
eliminating large parts of the search space. 

8 Related Work 

In this section we discuss related work in the areas of automatic synthesis of 
hardware circuits from high-level language constructs, array data-flow analysis, 
pipelining and design space exploration using high-level loop transformations. 



Synthesizing High-Level Constructs Languages such as VHDL and Ver- 
ilog allow programmers to migrate to configurable architectures without having 
to learn a radically new programming paradigm. Efforts in the area of new 
languages include Handel-C [18]. Several researchers have developed tools that 
map computations to reconfigurable custom computing architectures [24], while 
others have developed approaches to mapping applications to their own reconfig- 
urable architectures that are not FPGAs, e.g., RaPiD [10] and PipeRench [14]. 
The two projects most closely related to ours, the Nimble compiler and work 
by Babb et al. [6], map applications in C to FPGAs, but do not perform design 
space exploration. 



Design Space Exploration In this discussion, we focus only on related work 
that has attempted to use loop transformations to explore a wide design space. 
Other work has addressed more general issues such as finding a suitable architec- 
ture (either reconfigurable or not) for a particular set of applications (e.g., [1]). 
Derrien/Rajopadhye [ 1] describe a tiling strategy for doubly nested loops. They 
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model performance analytically and select a tile size that minimizes the itera- 
tion’s execution time. Cameron’s estimation approach builds on their own inter- 
nal data-flow representation using curve fitting techniques [17]. Qasem et al. [19] 
study the affects of array contraction and loop fusion. 



Array Data-Flow Analysis Previous work on array data flow analysis [7, 23, 
3] focused on data dependence analysis but not at the level of precision required 
to derive communication requirements for our platform. Parallelizing compiler 
communication analysis techniques [4, 16] exploited data parallelism. 



Pipelining In [5] Arnold created a software environment to program a set 
of FPGAs connected to a workstation; Callahan and Wawrzynek [8] used a 
VLIW-like compilation scheme for the GARP project; both works exploit intra- 
loop pipelined execution techniques. Goldstein et al. [14] describes a custom 
device that implements an execution-time reconfigurable fabric. Weinhardt and 
Luk [24] describes a set of program transformations to map the pipelined execu- 
tion of loops with loop-carried dependences onto custom machines. Du et al. [13] 
provide compiler support for exploiting coarse-grained pipelined parallelism in 
distributed systems. 



Discussion The research presented in this paper differs from the efforts men- 
tioned above in several respects. First the focus of this research is in developing 
an algorithm that can explore a wide number of design points, rather than 
selecting a single implementation. Second, the proposed algorithm takes as in- 
put a sequential application description and does not require the programmer 
to control the compiler’s transformations. Third, the proposed algorithm uses 
high-level compiler analysis and estimation techniques to guide the application 
of the transformations as well as evaluate the various design points. Our algo- 
rithm supports multi-dimensional array variables absent in previous analyses 
for the mapping of loop computations to FPGAs. Fourth, instead of focusing 
on intra-loop pipelining techniques that optimize resource utilization, we fo- 
cus on increased throughput through task parallelism coupled with pipelining, 
which we believe is a natural match for image processing data intensive and 
streaming applications. Within an FPGA, assuming the parallelism is achieved 
by the synthesis tool, we have more degrees of freedom by keeping loop bodies 
separate instead of fusing them. Finally, we use a commercially available behav- 
ioral synthesis tool to complement the parallelizing compiler techniques rather 
than creating an architecture-specific synthesis flow that partially replicates the 
functionality of existing commercial tools. Behavioral synthesis allows the de- 
sign space exploration to extract more accurate performance metrics (time and 
area used) rather than relying on a compiler-derived performance model. Our 
approach greatly expands the capability of behavioral synthesis tools through 
more precise program analysis. 
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9 Conclusion 

In this paper, we describe how parallelizing compiler technology can be adapted 
and integrated with hardware synthesis tools, to automatically derive, from 
sequential C programs, pipelined implementations for systems with multiple 
FPGAs and memories. We describe our implementation of these analyses in 
the DEFACTO system, and demonstrate this approach with a case study. We 
presented experimental results, derived, in part, automatically by our system. 
We show that we are able to reduce the size of the search space by reasoning 
about the maximum unroll factors, number of memory accesses and matching 
producer and consumer rates. While we employ a greedy search algorithm here, 
we plan to investigate trade-offs between and effects of adjusting unroll factors 
for pipeline stages both on and off the critical path. Once our design is within 
the space constraints of the chip capacity, we will continue to search for the best 
allocation of memory bandwidth. 
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Abstract. Convergent scheduling is a general framework for instruction 
scheduling and cluster assignment for parallel, clustered architectures. 
A convergent scheduler is composed of many independent passes, each of 
which implements a specific compiler heuristic. Each of the passes shares 
a common interface, which allows them to be run multiple times, and 
in any order. Because of this, a convergent scheduler is presented with 
a vast number of legal pass orderings. In this work, we use machine- 
learning techniques to automatically search for good orderings. We do so 
by evolving, through genetic programming, s-expressions that describe 
a particular pass sequence. Our system has the flexibility to create dy- 
namic sequences where the ordering of the passes is predicated upon 
characteristics of the program being compiled. In particular, we imple- 
mented a few tests on the present state of the code being compiled. We 
are able to find improved sequences for a range of clustered architec- 
tures. These sequences were tested with cross-validation, and generally 
outperform Desoli’s PCC and UAS. 



1 Introduction 

Instruction scheduling on modern microprocessors is an increasingly difficult 
problem. In almost all practical instances, it is NP-complete, and it often faces 
multiple contradictory constraints. For superscalars and VLIWs, the two primary 
issues are parallelism and register pressure. Traditional scheduling frameworks 
handle conflicting constraints and heuristics in an ad hoc manner. One approach 
is to direct all efforts toward the most serious problem. For example, many RISC 
schedulers focus on finding ILP and ignore register pressure altogether. Another 
approach is to attempt to address all the problems together. For example, there 
have been reasonable attempts to perform instruction scheduling and register 
allocation at the same time [1]. The third, and most common approach, is to 
address the constraints one at a time in a sequence of passes. This approach 
however, introduces pass ordering problems, as decisions made by early passes 



L. Rauchwerger (Ed.): LCPC 2003, LNCS 2958, pp. 17—31, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 



18 



Diego Puppin et al. 



are based on partial information and can adversely affect the quality of decisions 
made by subsequent passes. 

Convergent Scheduling [2, 3] alleviates pass ordering problems by spreading 
scheduling decisions over the entire compilation. Each pass makes soft decisions 
about instruction placement: it asserts its preference of instruction placement, 
but does not impose a hard schedule on subsequent passes. All passes in the 
convergent scheduler share a common interface: the input and output to each one 
is a collection of spatial and temporal preferences of instructions: a pass operates 
by modifying these data. As the scheduler applies the passes in succession, the 
preference distribution will converge to a final schedule that incorporates the 
preferences of all the constraints and heuristics. 

Passes can be run multiple times, and in any order. Thus, while mitigating 
ordering problems due to hard constraints, a convergent scheduler is presented 
with a limitless number of legal pass orders. In our previous work [3] , we tediously 
hand-tuned the pass order. This paper builds upon it by using machine learning 
techniques to automatically find good orderings for a convergent scheduler. Be- 
cause different parallel architectures have unique scheduling needs, the speedups 
our system is able to obtain by creating architecture-specific pass orderings is 
impressive. 

Equally impressive is the ease with which it finds effective sequences. Using 
a modestly sized cluster of workstations, our system is able to quickly find good 
convergent scheduling sequences. In less than two days, it discovers sequences 
that produce speedups ranging from 12% to 95% over our previous work, and 
generally outperform UAS [4] and PCC [5]. 

The remainder of the paper is organized as follows. Section 2 describes Ge- 
netic Programming, the machine-learning technique we use to explore the pass- 
order solution space. We describe our infrastructure and methodology in Sec- 
tion 3. Section 4 quickly describes the set of available heuristics. Section 5 follows 
with a description of the experimental results. Section 6 discusses related work, 
and finally, Section 7 concludes. Because of limited space, we refer you to [2, 3] 
for architecture and implementation details related to convergent scheduling. 

2 Genetic Programming 

From one generation to the next, architectures in the same processor family may 
have extremely different internal organizations. The Intel Pentium™ family of 
processors is a case in point. Even though the ISA has remained largely the 
same, the internal organization of the Pentium 4 is drastically different from 
that of the baseline Pentium. 

To help designers keep up with market pressure, it is necessary to automate 
as much of the design process as possible. In our first work with convergent 
scheduling, we tediously hand-tuned the sequence of passes. While the sequence 
works well for the processors we explored in our previous work, it does not gen- 
erally apply to new architectural configurations. Different parallel architectures 
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Fig. 1 . Flow of genetic programming. Genetic programming (GP) initially creates 
a population of expressions. Each expression is then assigned a fitness, which is a mea- 
sure of how well it satisfies the end goal. In our case, fitness is proportional to the exe- 
cution time of the compiled application(s). Until some user-defined cap on the number 
of generations is reached, the algorithm probabilistically chooses the best expressions 
for mating and continues. To guard against stagnation, some expressions undergo mu- 
tation 



necessarily emphasize different grains of computation, and thus have unique 
compilation needs. 

We therefore developed a tool to automatically customize our convergent 
scheduler to any given architecture. The tool generates a sequence of passes 
from those described in section 4. This section describes genetic programming 
(GP), the machine-learning technique that our tool uses. 

Of the many available learning techniques, we chose to employ genetic pro- 
gramming because its attributes fit the needs of our application. GP [6] is one 
example of evolutionary algorithm (EA). The thesis behind evolutionary com- 
putation is that a computational version of fitness-based selection, reproductive 
inheritance and blind variation acting upon a population will lead the indi- 
viduals in subsequent generations to adapt toward better performance in their 
environment. 

In the general GP framework, individuals are represented as parse trees (or 
equivalently, as lisp expressions) [6]. In our case, the parse trees represent a se- 
quence of conditionally executed passes. The result of each subexpression is either 
a convergent scheduling pass, or a sequence of passes. Our system evaluates an 
individual in a pre-order traversal of the tree. 

Table 1 shows the grammar we use to describe pass orders. The < variable > 
expression is used to extract pertinent information about the status of the sched- 
ule, and the shape of the block under analysis. This introspection allows the 
scheduler to run different passes based on schedule state. The four variables 
that our system considers are shown in Table 2. 
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Table 1. Grammar for genome s-expressions. <variable> returns the value computed 
by our tests on the graph and the current schedule 

{sexpr) ( ‘sequence’ {sexpr) { sexpr ) ) 

( ‘if’ { variable ) {sexpr) {sexpr) ) 

I ( (P^s) ) 

{variable) ::= #1 - Is imbalanced 
| #2 - Is fat 

#3 - Is within CPL 
#4 - Is placement bad 

{pass) ::= ‘PATH’ | ‘COMM’ | ‘NOISE’ | ‘INHUME’ 

| ‘SUCC’ j ‘LOAD’ j ‘EDGES’ j ‘DEP’ 

‘BEST’ j ‘FUNC’ j ‘PLACE’ j ‘SEQUENTIAL’ 
j ‘FIRST’ | ‘CLUSTER’ | ‘EMPHCP’ 

Table 2. The variables used by our system. Their values are updated during compi- 
lation. 



Variable 


True if 


ffl Is imbalanced 
Is fat 

#3 Is within CPL 
#4 Is placement bad 


the difference in load between the most and the least loaded 
cluster is larger than 1 /numcluster 

the number of independent critical paths is larger than the num- 
ber of tiles 

the number of instructions in the block is smaller than the num- 
ber of tiles times the critical path length 

the number of unplaced instructions is more than half the num- 
ber of instructions in the block 



Figure 1 shows the general flow of genetic programming. The algorithm starts 
by creating an initial population of random parse trees. It then compiles and runs 
each of the benchmarks in our training set for each individual in the population. 
Each individual is then assigned a fitness based on how fast each of the associ- 
ated programs in the training set execute. In our case, the fitness is simply the 
average speedup (compared to the sequence used in previous work) over all the 
benchmarks in the training set. 

The fittest individuals are chosen for crossover , the GP analogy of sexual 
reproduction. Crossover begins by choosing two well-fit individuals. Our system 
then clones the selected individuals, chooses a random subexpression in each 
of them, and swaps them. The net result is two new individuals, composed of 
building blocks from two fit parents. 

To guard against stagnant populations, GP often uses mutation. Mutations 
simply replace a randomly chosen subtree with a new random expression. For 
details on the mutation operators we implemented, see [7, p. 242]. In our imple- 
mentation, the GP algorithm halts when a user-defined number of iterations has 
been reached. 
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We conclude this section by noting some of GP’s attractive features. First, 
it is capable of exploring high-dimensional spaces. It is also highly scalable, 
highly parallel and can run effectively on a distributed cluster of workstations. 
In addition, its solutions are human-readable, compared with other algorithms 
(e.g. neural networks) where the solution is embedded in a very complex state 
space. 

3 Infrastructure and Methodology 

This section describes our compilation framework as well as the methodology 
we used to collect results. We begin by describing the GP parameters we used 
to train the convergent scheduler, then we give an overview of our experimental 
compiler and VLIW simulator. 

3.1 GP Parameters 

We wrapped the GP framework depicted in Figure 1 around our compiler and 
simulator. For each individual in the population, our harness compiles the bench- 
marks in our training suite with the pass ordering described by its genome. All 
experiments maintain a population of 200 individuals, initially randomly cho- 
sen. After every generation we discard the weakest 20% of the population, and 
replace them with new individuals. New individuals are created to replace the 
discarded portion of the population. Of these new pass orderings, half of them 
are complelety random, and the remainder are created via the crossover opera- 
tor described in the last section. 5% of the individuals created via crossover are 
subject to mutation. Finally, we run each experiment for 40 generations. 

Fitness is measured as the average speed-up (over all the benchmarks in our 
training suite) when compared against the pass ordering that we used in our 
previous work [3] . We also reward parsimony by giving preference to the shorter 
of two otherwise equivalently fit sequences. 

3.2 Compiler Flow and Simulation Environment 

Our compilation process begins in the SUIF front-end [8]. In addition to per- 
forming alignment analysis [9], the front-end carries out traditional optimizations 
such as loop unrolling, constant propagation, copy propagation, and dead code 
elimination. 

Our Chours VLIW back-end follows [10]. Written using MachSUIF [11], the 
back-end allows us to easily vary the number of clusters, functional units, and 
registers in the target architecture. Instruction latencies, memory access laten- 
cies, and inter-cluster communication latencies are also configurable. The con- 
vergent scheduler uses such information, combined with data from alignment 
analysis, to generate effective code. Similarly, our register allocator must know 
the number of registers in each cluster. 
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The result of the compilation process is a compiled simulator that we use 
to collect performance numbers. The simulator accurately models the latency 
of each functional unit. We assume that all functional units are fully pipelined. 
Furthermore, the simulator enforces lock-step execution. Thus, if a memory in- 
struction misses in the cache, all clusters will stall. The memory system is run- 
time configurable so we can easily isolate the performance of various memory 
topologies. In total, the back-end comprises nine compiler passes and a simula- 
tion library. 

The four target architectures on which we experimented are described below. 



Baseline (4cl) The baseline architecture is a 4-cluster VLIW with rich inter- 
connectivity. In this configuration, the clusters are fully connected with a 4x4 
crossbar. Thus, the clusters can exchange up to four words every cycle. The de- 
lay for the communication is 1 cycle. Register file, functional units and LI cache 
are split into the clusters - even though every address of the memory can be 
accessed by any cluster - with a penalty of 1 cycle for non-local addresses. The 
cache takes 6 cycles to access and the register file takes 2 cycles. In addition, 
memory writes take 1 cycle. Each cluster has 64 general-purpose registers and 
64 floating-point registers. 



Limited Bus (4cl-comm) This architecture is similar to the baseline archi- 
tecture, the only difference being inter-cluster communication capabilities. This 
architecture only routes one word of data per cycle on a shared bus, which can 
be snooped, thus creating a basic broadcasting capability. Because this model 
has limited bandwidth, the space-time scheduler must be more conservative in 
splitting computation across clusters. 



Limited Bus (2cl-comm) Another experiment uses an architecture that is 
substantially weaker than the baseline. It is the same as machine 4cl-comm, 
except it only has 2 clusters. 



Limited Registers (4cl-regs) The final machine configuration on which we 
test our system is identical to the baseline architecture, except that each clus- 
ter has half the number of registers (32 general-purpose and 32 floating-point 
registers). 



4 Available Passes 

In this section, we describe quickly the passes used in our experimental frame- 
work. Passes are divided into time heuristics, passes for placement and critical 
path, for communication and load balancing, and register allocation. The mis- 
cellaneous passes help the convergence by breaking symmetry and strengthening 
the current assignment. For implementation details, we refer the reader to [2, 3]. 
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4.1 Time Heuristics 

Initital Time Assignment (INITTIME) initializes the weight matrix by 
squeezing to 0 all the time slots that are unfeasible for a particular instruc- 
tion. If the distance to the farthest root of the data-depedency graph is t, 
the preference for that instruction to be scheduled a cycle earlier than t is 
set to 0. The distance to the leaf is similarly used. 

Dependence Enforcement (DEP) verifies that no instruction is scheduled 
before an instruction on which it depends. This is done by reducing the 
preference for early time slots in the dependent instruction. 

Functional Units (FUNC) reduces the preference for overloaded time-slots, 
i.e. slots for which the load is higher than the number of available functional 
units. 

Emphasize Critical Path Distance (EMPHCP) tries to schedule every 
instruction at the time indicated by its level, i.e. the distance from roots 
and leaves. 



4.2 Placement and Critical Path 

Push to First Cluster (FIRST) gives instructions a slight bias to the first 
cluster, where our compiler guarantees the presence of all alive registers at 
the end of each block (so, less communication is needed for instructions in 
the first cluster). 

Preplacement (PLACE) increases, for preplaced instructions (see [9]), the 
preference for their home cluster. 

Preplacement Propagation (PLACEPROP) propagates the information 
about preplacement to neighbors in the data dependence graph. The prefer- 
ence for each cluster decreases with the distance (in the dependence graph) 
from the closest preplaced instruction in that cluster. 

Critical Path Strengthening (PATH) identifies one critical path in the 
schedule, and tries to keep it together in the least loaded cluster or in the 
home cluster of its preplaced instructions. 

Path Propagation (PATHPROP) identifies high-confidence instructions, 
and propagates their preferences to the neighbors in the critical path. 

Create Clusters (CLUSTER) creates small instruction clusters using the 
Partial Component Clustering [5] , and then allocates them to clusters trying 
to minimize communication. This is useful when the preplacement informa- 
tion is poor. 



4.3 Communication and Load Balancing 

Communication Minimization (COMM) tries to minimize communication 
by keeping in the same cluster instructions that are neighbors in the depen- 
dence graph. 
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Parallelism for Successors (SUCC) exploits the broadcast feature of some 
of our VLIW configurations by distributing across clusters the children of 
an instruction which is already communicating data on the bus. The other 
instructions can snoop the value, so no extra communications will be needed. 

Load Balance(LOAD) reduces the preferences for the cluster that has the 
highest preferences so far. 

Level Distribute (LEVEL) tries to put in different clusters the instructions 
that are in the same level (distance from roots and leaves) if they do not 
communicate. 



4.4 Register Allocation 

Break Edges (EDGES) tries to reduce register pressure by breaking the data 
dependence edges that cross any specific time t in the schedule (if there 
are more edges than the available registers). This is done by reducing the 
preferences of the instructions in the edges to be scheduled around t. 

Reduce Parallelism (SEQUENTIAL) emphasizes the sequential order of 
instructions in the basic block. This reduces parallelism and register pressure 
due to values with long life-span. 



4.5 Miscellaneous 

Noise Introduction (NOISE) adds noise to the distribution to break sym- 
metry in subsequent choices. 

Assignment Strengthening (BEST) boosts the highest preference in the 
schedule, so far. 

5 Results 

In this section, we compare the performance of convergent scheduling to two 
existing assignment/scheduling techniques for clustered VLIW architectures: 
UAS [4] and PCC [5]. We augmented each existing algorithm with preplacement 
information. For UAS, we modified the CPSC heuristic described in the original 
paper to give the highest priority to the home cluster of preplaced instructions. 
For PCC, the algorithm for estimating schedule lengths and communication costs 
properly accounts for preplacement information. It does so by modeling the extra 
costs incurred by the clustered VLIW machine for a non-local memory access. 

For simplicity, in the following, we will refer to the sequence (SEQ (PassA) 
(PassB)) simply as (PassA) (PassB), removing SEQ: when no variables are 
used, genomes reduce to a linear sequence of passes. Also, in all of our experi- 
ments, (inittime) is hardwired to be the first pass, as part of the initialization, 
and (place) is always run at the end of the sequence to guarantee semantics. 
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Fig. 2. Performance comparisons between PCC, UAS, and Convergent scheduling on 
a four-cluster VLIW architecture. Speedup is relative to a single-cluster machine 




5.1 Baseline (4cl) 

The baseline sequence was hand-tuned in our initial work with convergent 
scheduling. For the baseline architecture, our compiler used the following se- 
quence: 

(inittime) (noise) (first) (path) (comm) (place) 

(placeprop) (comm) (emphcp) (place) 

As shown in Figure 2, convergent scheduling outperforms UAS and PCC 
by 14% and 28%, respectively, on a four-clustered VLIW machine. Convergent 
scheduling is able to use preplacement information to find good natural partitions 
for our dense matrix benchmarks. 

5.2 Limited Bus (4cl-comm) 

We use this configuration to perform many experiments. We evolved a sequence 
for 100 generations, with 200 individuals, over seven representative benchmarks. 

Figure 4 plots the fitness of the best creature over time. The fitness is mea- 
sured as the average (across benchmarks) normalized completion time with 
respect to the sequence for our baseline architecture. The sequence improves 
quickly in the first 36 generations. After that, only minor and slow improve- 
ments in fitness could be observed. This is why, in our cross-validation tests (see 
section 5.5), we limit our evolution to 40 generations. 
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Speedup on 4cl-comm 
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Model □ PCC UAS ■ Conv. ■ Evolved 



Fig. 3. Speedup on 4cl-comm compared with 1-cluster convergent scheduling (original 
sequence). In the graph, conv. is the baseline sequence, evolved is the new sequence for 
this architecture. 



The evolved sequence is more conservative in communication, (dep) and 
(func) are important: (dep), as a side effect, increases the probability that 
two dependent instructions are scheduled next to each other in space and time; 
(func) reduces peaks on overloaded clusters, which could lead to high amounts 
of localized communication. Also, the (comm) pass is run twice, in order to limit 
the total communication load. 

(inittime) (func) (dep) (func) (load) (func) (dep) (func) 

(comm) (dep) (func) (comm) (place) 

The plot in Figure 3 compares the evolved sequence with the original se- 
quence and our reference schedulers. The evolved sequence performs about 10% 
better than UAS, and about 95% better than the sequence tuned for the base- 
line architecture. In this test, PCC performed extremely poorly, probably due 
to limitations in the modeling of communication done by our implementation of 
the internal simplified scheduler (see [5]). 

5.3 Limited Bus (2cl-comm) 

(inittime) (dep) (noise) (func) (noise) (noise) (comm) 

(func) (dep) (func) (place) 

Similar to the previous tests, (comm), (dep) and (func) are important in 
creating a smooth schedule. We notice the strong presence of (noise) in the 
middle of the sequence. It appears as if the pass is intended to move away from 
local minima by shaking up the schedule. 

The evolved sequence outperforms UAS (about 4% better) and PCC (about 
5% better). Here PCC does not show the same problems present with 4cl-comm 
(see Figure 5). We observe an improvement of 12% over the baseline sequence. 
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Fig. 4. Completion time for the set of benchmarks for the fittest individual, during 
evolution on 4cl-comm 
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Fig. 5. Speedup on 2cl-comm 



5.4 Limited Registers (4cl-regs) 

Figure 6 shows the performance of the evolved sequence when compared with 
our baseline and our reference. We measure an improvement of 68% over the 
baseline sequence. Here again, (func) is a very important pass. UAS outruns 
convergent scheduling in this architecture by 6%, and PCC by 2%. We believe 
this is due to the need for new expressive heuristics for register allocation. Future 
work will investigate this. 

(inittime) (func) (dep) (func) (func) (func) (func) (path) 
(func) (place) 

5.5 Leave-One-Out Cross Validation 

We tested the robustness of our system by using leave-one-out cross validation 
on 4cl-comm. In essence, cross validation helps us quantify how applicable the 
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Fig. 6. Speedup on 4cl-regs. 

Table 3. The sequence evolved in our cross-validation tests. 



Excluded 

benchmark 


Sequence 


cholesky 


(inittime) (comm) (load) (comm) (load) (func) (place) 


fir 


(inittime) (func) (place) 


yuv 


(inittime) (func) (place) 


tomcatv 


(inittime) (func) (best) (place) 


mxm 


(inittime) (best) (best) (best) (func) (place) (place) 


wmul 


(inittime) (func) (dep) (func) (place) 


rbsorf 


(inittime) (best) (func) (place) 



sequences are when applied to benchmarks that were not in the training set. 
The evolution was rerun excluding one of the seven benchmarks, and the result 
tested again on the excluded benchmark. In Table 4, the results are shown as 
speed-up compared with a one-cluster architecture. The seven cross-validation 
evolutions reached results very similar to the full evolution, for the excluded 
benchmarks too. In particular, the sequences evolved excluding one benchmark 
still outperform, on average, the comparison compilers, UAS and PCC. 

The seven evolved sequences (in Table 3) are all similar: (func) is the most 
important pass for this architecture. 

5.6 Summary of Results 

We verified that convergent scheduling is well suited to a set of different ar- 
chitectures. Running on 20 dual-processor Pentium 4 machines, evolution takes 
a couple of days. 

Sequences that contain conditional expressions never appeared in the best 
individuals. It turns out that running a pass is more beneficial than running 
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Table 4. Results of cross validation, speed-up compared with 1-cluster architecture. 
The highlighted numbers refer to the performance of the excluded benchmark, when 
using the evolved sequence. 



benchmark 


cholesky fir 


Excluded benchmark 
yuv tomcatv mxm 


vvmul 


rbsorf 


full 


cholesky 


2.18 


2.18 


2.18 


2.18 


2.18 


2.17 


2.18 


2.18 


fir 


1.35 


1.35 


1.35 


1.35 


1.35 


1.35 


1.35 


1.35 


yuv 


1.53 


1.53 


1.53 


1.53 


1.53 


1.16 


1.53 


1.53 


tomcatv 


1.60 


1.35 


1.35 


1.45 


1.47 


1.55 


1.44 


1.37 


mxm 


2.03 


2.04 


2.04 


2.04 


2.12 


2.33 


2.04 


1.96 


vvmul 


2.18 


2.18 


2.18 


2.18 


2.18 


2.25 


2.18 


2.18 


rbsorf 


2.41 


2.41 


2.41 


2.44 


2.36 


2.44 


2.41 


2.41 


average 


1.90 


1.86 


1.86 


1.88 


1.89 


1.89 


1.88 


1.86 



a test to condition its execution. This is largely because convergent scheduling 
passes are somewhat symbiotic by design. In other words, the results show that 
passes do not disrupt good schedules. So, running extra passes is usually not 
detrimental to the final result. 

We verified that running a complex measurement can take as much time 
as running a simple pass. Therefore, when measuring the complexity of result- 
ing sequences, we assign equal weight to passes and tests. Our bias for shorter 
genomes (parsimony pressure) penalizes sequences with extra tests as well as 
sequences with useless passes. In the end, conditional tests were not used in the 
best sequences. Rather, all passes are unconditionally run. Nevertheless, we still 
believe in the potential of this approach, and leave further exploration to future 
work. 

6 Related Work 

Many researchers have used machine-learning techniques to solve hard compi- 
lation problems. Therefore, only the most relevant works are discussed here. 
Cooper et al. use a genetic-algorithm solution to evolve the order of passes in 
an experimental compiler [12]. Our research extends theirs in many significant 
ways. First, our learning representation allows for conditional execution of passes, 
while theirs does not. In addition, we differ in the end goal; because they were 
targeting embedded microprocessors, they based fitness on code size. While this 
is a legitimate metric, code size is not a big issue for parallel architectures, nor 
does it necessarily correlate with wall clock performance. We also simultane- 
ously train on multiple benchmarks to create general-purpose solutions. They 
use the application-specific sequences to hand-craft a general-purpose solution. 
Finally, we believe the convergent scheduling solution space is more interesting 
than that of an ordinary backend. The symmetry and unselfishness of convergent 
scheduling passes implies an interesting and immense solution space. 

Calder et al. used supervised learning techniques to fine-tune static branch 
prediction heuristics [13] . They employ two learning techniques — neural net- 
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works and decision trees — to search for effective static branch prediction heuris- 
tics. While our methodology is similar, our work differs in several important 
ways. Most importantly, we use unsupervised learning, while they use super- 
vised learning. Unsupervised learning is used to capture inherent organization 
in data, and thus, only input data is required for training. Supervised learning 
learns to match training inputs with known outcomes. This means that their 
learning techniques rely on knowing the optimal outcome, while ours does not. 
Our problem demands an unsupervised method since optimal compiler sequences 
are not known. 

The COGEN(t) compiler creatively uses genetic algorithms to map code to ir- 
regular DSPs [14]. This compiler, though interesting, evolves on a per-application 
basis. Nonetheless, the compile-once nature of DSP applications may warrant the 
long, iterative compilation process. 

7 Conclusion 

Time-to-market pressures make it difficult to effectively target next generation 
processors. Convergent scheduling’s simple interface alleviates such constraints 
by facilitating rapid prototyping of passes. In addition, an architecture-specific 
pass is not as susceptible to bad decisions made by previously run passes as in 
ordinary compilers. 

Because the scheduler’s framework allows passes to be run in any order, there 
are countless legal pass orders to consider. This paper showed how machine- 
learning techniques could be used to automatically search the pass-order solution 
space. Our genetic programming technique allowed us to easily re-target new 
architectures. 

In this paper, we also experimented with learning dynamic policies. Instead 
of choosing a fixed static sequence of passes, our system is capable of dynami- 
cally choosing the best passes for each scheduling unit, based on the status of 
the schedule. Although the learning algorithm did not find sequences that condi- 
tionally executed passes, we still have reasons to believe that this is a promising 
approach. Future work will explore this in greater detail. 

In closing, our technique was able to find architecture-specific pass orders 
which improved execution time by 12% to 95%. Cross validation showed that 
performance improvement is not limited to the benchmarks on which the se- 
quence was trained. 
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Abstract. Program profiling can help performance prediction and com- 
piler optimization. This paper describes the initial work behind TFP, 
a new profiling strategy that can gather and verify a range of flow-specific 
information at runtime. While TFP can collect more refined information 
than block, edge or path profiling, it is only 5.75% slower than a very 
fast runtime path-profiling technique. Statistics collected using TFP over 
the SPEC2000 benchmarks reveal possibilities for further flow-specific 
runtime optimizations. We also show how TFP can improve the overall 
performance of a real application. 

Keywords: Profiling, dynamic compilation, run-time optimization. 



1 Introduction 

Profiling a program can be used to predict the program’s performance [1], iden- 
tify heavily executed code regions [2, 3, 4], perform additional code optimiza- 
tions [5, 6], and locate data access patterns [7]. Traditionally, profiling has been 
used to gather information on one execution of the program, which is then used 
to improve its performance on subsequent runs. In the context of dynamic compi- 
lation and runtime optimizations, profiling information gathered in the same run 
itself can be used to improve the program’s performance. This creates a greater 
need for efficient profiling, since the runtime overheads might exceed any possible 
benefit achieved from its use. In addition, the information gathered by profiling 
must be relevant for runtime optimizations and should remain true while the 
optimized code is executed. In this paper we propose a new profiling framework, 
TFP (Time- Sensitive, Flow- Specific Profiling), that extracts temporal control 
flow patterns from the code at runtime which are persistent in nature i.e. , hold 
true for a given, selectable period of time. This information can then be used to 
guide possible optimizations from a dynamic perspective. This paper makes the 
following contributions: 

1. Proposes a new profiling strategy that is both flow-specific and time- 
sensitive. 

2. Provides a comparison of the profiling overheads of TFP with the dynamic 
path profiling of [8]. On the SPEC 2000 benchmarks, we show that TFP is 
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on average only 5.75% slower than the technique of [8] (which is well suited 
for a dynamic environment), while collecting a wider range of information. 

3. Provides a case study of RNAfold [9] that demonstrates that information 
gathered by TFP can be used to improve overall performance of an applica- 
tion. 

The rest of this paper is organized as follows. Section 2 describes the background 
and motivation for our work. Section 3 discusses our framework in detail and how 
it can be used to collect a range of runtime information. In Section 4 we discuss 
some implementation details and how they can be changed to meet specific 
requirements. Section 5 presents experimental results and a case study using our 
framework. We conclude in Section 6 with possible future research directions. 



2 Background and Motivation 

Profiling code to gather information about the flow of control has received con- 
siderable attention over the years. Most existing profiling techniques are meant 
for off-line program analysis. However, with the advent of dynamic compilation 
and runtime optimizations, the use of profile data generated for runtime use has 
increased [10, 11, 7, 12, 13, 14, 15]. In [16, 7], a technique called Bursty Tracing 
is introduced that facilitates the use of runtime profiling. This technique allows 
the programmer to skip between profiled and un-profiled versions of a code as 
well as control the duration spent in either version. Such a technique will al- 
low the user to control the overheads involved in running profiled code to a far 
greater extent. Some of these techniques require hardware support while others 
rely completely on software. Our work falls in the latter category. 

Some of the more popular flow profiling techniques include block profiling, 
edge profiling [17], [18] and path profiling [2], [19]. These techniques differ in the 
granularity of the information they collect with path profiling T> edge profiling 
D block profiling , i.e. all the information gathered by block profiling can be 
gathered by edge profiling, while all the information gathered by edge profiling 
can be collected by path profiling. However, retrieving this information comes at 
a greater cost in terms of overheads since one needs to maintain data structures 
to save this information and often require multiple passes of these data structures 
to get the necessary granularity of information. 

In [8], Bala developed a profiling technique well suited to finding path pro- 
files in a dynamic environment. This technique instruments each edge of a code 
segment with a 0 or 1, and represents each path as a starting block followed 
by a bit sequence of 0’s and 7’s. The easy implementation and simplicity of 
this technique makes it an attractive choice for runtime path profiling. With 
adequate support from the compiler and hardware this technique can provide 
near-zero overhead profiling and forms the basis of comparison for the work we 
develop here. 

However, several possible runtime optimizations such as dependence analysis 
and loop unrolling can benefit from block and edge profiling alone, and often 
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for (i=1 to 100) 

{ 

if(i%2 = 0) 
f(); 

else 

g(); 

} 



for (i=1 to 100) 

{ 

if (i > 50) 

f(); 

else 

g(); 

} 



(a) 



(b) 



Fig. 1. Sample Code Snippets: (b) has a path with 50-PFP while (a) has no such path 



do not require more refined information. Even though this information might be 
retrieved from path profiles, it could require considerable additional processing 
(to store the blocks and edges a path corresponds to, and then scan through 
the paths again to retrieve the necessary information). A fundamental question 
to be addressed by our research is whether there is additional advantage in 
using more powerful profiling information at runtime. We also question whether 
a detailed analysis of the programs execution pattern is useful for online analysis. 
For example, one might want to detect whether a single path is being executed 
persistently (thereby making it a possible target of optimizations [15]) or observe 
if certain pathological cases never occur [20] . This paper seeks to combine several 
benefits of block, edge and path profiling in a single unified profiling framework 
- providing easy and efficient access to a range of information at runtime. 



3 The TFP Approach to Profiling 

TFP can not only count frequencies of flow-patterns but is also capable of cap- 
turing a variety of temporal trends in the code. These trends can then be used 
to guide runtime optimizations. To capture this idea we make use of persistence 
i.e. flow patterns and information that continuously holds true for a period of 
time. We define the property of persistence as follows: 

A I\ -Persistent Flow Property (K-PFP) of a program segment is a property 
which holds true for the control flow of that segment for K consecutive execu- 
tions of the segment. 

The motivation for such a technique lies in the assumption that if a PFP 
holds for a period AT, it may continue for some additional time. Additional 
optimizations could then be made assuming the trend would remain persistent. 
For example consider the two code snippets in Figure 1. 

Traditional frequency-based profilers will find both the paths along f() and 
g() to be equally hot [2] . The code in 1(a) is not suitable for runtime optimization, 
since the path in the loop body only lasts for one iteration. On the other hand 
in 1(b) an optimization that is valid for only one path in the loop body would 
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remain valid longer, possibly making it worthwhile to perform the optimization. 
This shows that frequency is not the only parameter for locating hot paths, 
persistence can also be considered (similar distinctions about access patterns 
can also be found in [21] for the purpose of code layout). When using a PFP 
guided approach, the code snippet in 1(b) will qualify as a 50-PFP but the one 
in 1(a) will not, allowing us to differentiate between them. 

Even if a sequence of code does not have a persistent path, we might still 
be interested in finding other PFP s. Each PFP might lead to different kind 
and granularity of optimization. Listed below are some other possible PFPs and 
examples of runtime optimizations that can be based on them. 

1. Persistently Taken Paths: This information can help the compiler identify 
a possibly smaller segment of code on which runtime path-specific optimiza- 
tions might be conducted. 

2. Persistently Untaken Basic Blocks: This information would allow one to 
form a smaller CFG by eliminating these blocks from the original CFG. 
As a result, one can eliminate dependences, loops, variables etc., leading to 
further optimizations. 

3. Persistently Taken Path Segments: Even if persistent paths do not exist we 
might have sub-paths that are persistent. This can help in eliminating certain 
dependences and code regions. 

4. Whether a Given Set of Edges Are Ever Taken: This information can be 
used to remove possible dependences at runtime. 

Though some of the existing profiling techniques can be modified to incorpo- 
rate persistence, they are aimed at gathering one kind of information efficiently. 
While path profiling can do a good job of example PFP 1, block profiling can 
perform 2 and edge profiling can collect 3 and 4 efficiently. Path profiling tech- 
niques like [2] and [8] can also be used to detect 2, 3 and 4, but would require 
maintaining additional data structures, storing additional data, and making mul- 
tiple passes of the profiled information. TFP provides a unified framework that 
collects all the above mentioned PFP s with a small amount of instrumentation. 
The following section describes TFP in detail. 

3.1 Detailed Description of TFP 

TFP profiles acyclic code regions (we later describe in Section 4 how we can 
include nested loops) and is a hybrid between Bala’s method [8] of path pro- 
filing and block profiling. Instead of assigning each edge a 0 or 1 (as in Bala’s 
method), we represent each (profiled) block by a single bit position in a bit 
string. Conceptually, each block represents an integer which is a unique power 
of two (i.e. blocki is represented by the value 2 l ). The initial block always sets 
the value of this register to 0. At the end of each profiled block an instruction 
is inserted to perform a mathematical ADD (or bitwise OR) of this number to 
a register r. The value of this register identifies the path taken, and Bookkeeping 
code is inserted in the exit block of the instrumented region. For acyclic code 
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Fig. 2. An example of profiling using TFP (values of r corresponding to the paths is 
also given) 

regions with multiple exit blocks we add the Bookkeeping code in each of the 
exit blocks. The Bookkeeping code can vary with the kind of PFP( s) we wish 
to track, as illustrated in the sections to come. Figure 2 gives an example of our 
profiling method, showing a sample code region, the inserted instrumentation 
code and the register values associated with the various paths. 

The basic idea behind our approach is that each path will produce a unique 
value in the register, as well as give all the information about the blocks that 
form that path. Thus we get the benefits of both block and path profiling si- 
multaneously (and some benefits of edge profiling as shown in Section 3.5). This 
idea is embodied in the following: 

Theorem 1. With the register assignments inserted as described above, each 
different value of the register r corresponds to a unique path. 

Proof. Since each basic block is represented by a bit in the register, the only 
way in which the register can get a value is by traversing all the blocks that 
correspond to a 1 in the bitwise representation of the value. Thus given a value 
in the register, we can determine the basic blocks in the corresponding path. 
To complete the proof, we have to show that no two paths can have the ex- 
act same set of blocks in them. The proof is by contradiction. Assume that 
X = Xi,X 2 , ...,Xn are the basic blocks that were traversed and Y = 2/i, 2 / 2 , y n 

and Z = zi, Z 2 , ■■■, Zn are two different paths using X, i.e., Y and Z are two 
different permutations of X. All the elements of X must be unique, else X would 
have a loop, and thus an associated back edge, contradicting our assumption. 
Now let k be the position where Y and Z first differ i.e. yi — Zi for i = 1, ..{k — 1) 
and yk yf Zk- Obviously k < N or Y and Z would be identical. Now since yk y^ Zk 
there is some value Zj, j e {k + 1, ..., N) for which Zj = yk (since both Y and Z 
have the same set of elements). Thus there exists an edge from Zj-i to yk in the 
path Z. Now Zj - 1 has to appear in Y as well and it can only appear after or at 
the k th position. Thus in Y there is a path from yk to Zk-i and we also know 
that there is an edge from Zk-i to yk- Thus this edge is actually a back edge, 
contradicting our assumption for profiling candidates. Hence Y and Z cannot be 
different. □ 
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TFP provides two major benefits when compared to traditional path profiling 
techniques. Firstly it collects a wider range of information as a by-product of 
path profiling. Other path profiling techniques would require additional data 
structures (TFP uses just a few variables), and multiple passes over these data 
structures to find this information. The second advantage that TFP provides is 
that most other path profiling techniques instrument the edges which can result 
in additional branches in the program, which can affect the overall performance. 
TFP instruments at the block level and though this requires instrumenting every 
block of the region it does not add further checks in the code. 

We now describe how TFP can be used to detect some of the PFP s mentioned 
earlier in this section. We first consider the parameter Persistence Factor ( K ) 
which represents a lower bound (threshold) on the persistence of interest. To 
gather various K-PFPs the TFP instrumented code is executed for K iterations 
(this can be achieved using [16]). The values of the Bookkeeping variables at the 
end of these iterations reveal the various K-PFPs observed. 

3.2 Persistent Paths 

The following Bookkeeping is inserted once at the end of the acyclic region, to 
track persistent paths using TFP. 

bblock-] = bblock-) AND r; 
bblock2 = bblock2 OR r; 

Bookkeeping for Persistent Paths 

After running the TFP instrumented code for K successive executions if bblocki 
and bblock 2 are equal then we know that we have a A-persistent path. This 
follows from the fact that each path produces a unique value of r (from Theorem 
1) and the only way bblocki and bblock 2 will be equal is if r remained unchanged 
for the I\ iterations ( bblocki and bblock 2 are initially set to -1 and 0 respectively). 
If we detect a persistent path then we can expect the code to remain in the same 
path for a while and make further optimizations based on this assumption. 

3.3 Path Segments that Are Always Taken 

Even if we do not find persistent paths using the method given in Section 3.2, 
we might still want to find the set of path segments or sub-paths that are always 
taken. To get this information using TFP, we use the same Bookkeeping code 
as in Section 3.2 but assign the numbers ADD/ORed to r in each basic block 
in a topologically sorted manner (this need not be done at runtime if one uses 
a framework like [L6] or if all regions of possible interest are instrumented at 
compile time itself). Thus if each instrumented basic block bi ADD/ORs the 
value Vi to r, then Vi < Vj if bi comes before bj in the topologically sorted order 
of the blocks. 

To gather the information about path segments that are always taken (during 
the K successive executions of the TFP instrumented code), we need to scan 
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through bblocki (from left to right or right to left) and join blocks that correspond 
to adjacent l’s in bblocki , unless there is some other block in between the two 
blocks in bblock 2 that is a 1. Thus if xn,xa, where (*i < i 2 ... < ik) is 

a persistent path segment then the bit locations *i , * 2 > • will be 1 in bblocki 
and the only bits which are 1 in bblock 2 between ii and ik will be those at 
(ii,i 2 , This follows from the fact that since the blocks are topologically 

sorted, then the edge {xn ,^ 2 ) is always taken if no block between Xu and x a 
is ever taken. 

Consider the program graph shown in Figure 2. Assume that during a profiled 
run only ACEF and ABCEF are taken. At the end of the profiled run bblocki 
will contain 11010 and bblock 2 will contain 11011. Following the technique given 
above we see that bit positions 1, 2 and 4 are set to 1 in bblocki and none of the 
other intermediate positions (position 3) are set to 1 in bblock 2 . Thus we can 
conclude that the path segment connecting bit positions 1, 2 and 4 (i.e CEF ) is 
always taken - which is the case. 

3.4 Basic Blocks that Are Not Taken 

To determine the basic blocks that are not taken, we could use block profiling 
and check each counter of the basic blocks to see if they are 0. However, for 
TFP, we do not need counters to gather this information. Using our method this 
information can be easily obtained using the following code for Bookkeeping. 

bblock = bblock OR r; 

Bookkeeping for Blocks Persistently Not Taken 

On executing the instrumented code for K successive executions, the variable 
bblock has a 1 for all the blocks that get taken at any time during the execution 
of the instrumented code, and all bit positions that have 0’s correspond to basic 
blocks that are not executed even once, in those K executions. Note that TFP 
doesn’t gather the exact frequency of the blocks that are taken. 

It can be observed that the Bookkeeping for this PFP is a subset of the ones 
described in Sections 3.2 and 3.3, and need not be additionally inserted in case 
we are also instrumenting for persistently paths or sub-paths. 

3.5 Tracking If Specific Edges Are Taken 

Several useful optimizations are impossible to verify statically because of possible 
dependences along different control flow paths. Our framework provides an easy 
way of tracking whether a specific set of edges is ever executed (or persistently 
not executed). The compiler can use this information to eliminate false depen- 
dences at runtime, enabling several optimizations (such as constant propagation, 
loop unrolling, code compaction etc). To achieve this using our framework, we 
assign blocks their additive values based on a topological sort as described in 
Section 3.3. Thereafter if we want to test if an edge between blocks i and j is ever 
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taken, we add a test in the Bookkeeping code to check if the bit positions i and j 
are ever simultaneously 1 with no other 1 ’s between them. This can be done 
by assigning two variables having the initial values of ri, an integer with all bits 
between positions i and j as 1 , and ?' 2 , an integer with only bit positions i and j 
set as 1. These variables can be defined at compile time with their corresponding 
values. At runtime, the following code is added to Bookkeeping: 

if ( (r AND r 1 ) == r 2 ) 

inform OPTIMIZER that edge (i, j) is taken; 



Bookkeeping Needed to Track If Specific Edges Are Taken 

It is easy to see why this works. If the edge i — > j is ever taken, then the 
bit positions i and j of r will be 1 (by definition of our profiling technique). 
Moreover all the intermediate bit positions between i and j will be 0 (otherwise 
the edge i — > j could not have been taken since the blocks are topologically 
sorted). Thus when r is ANDed with r i, the only bit positions which will be 1 
are i and j, making the profiled code call the optimizer. If after executing the 
TFP instrumented code for K executions the OPTIMIZER is not informed (we 
need not necessarily inform the OPTIMIZER but can just set a flag to true as 
well) we can conclude that the monitored edge is not taken persistently. 

3.6 TFP for Normal Path Profiles 

TFP can be used to measure normal path frequencies as well. Each path in TFP 
produces a unique value in r. This value can be hashed into a counter array at 
the end of the profiled region to maintain the path frequencies. However, path 
profiling techniques like [2] will do a better job of maintaining such frequencies 
alone. The range of the path identifiers used by this technique is exactly equal 
to the total number of paths, making direct indexing into the counter array pos- 
sible. Both TFP and Bala’s method [8] use path identifiers that do not reflect 
the actual number of paths in the instrumented region, thereby requiring hash- 
ing. To summarize, several dynamic optimizations might not need the “exact” 
frequency of paths. However, if needed, TFP can easily be modified to maintain 
these frequencies without adding to the overheads significantly ([8] required ~ 

3 cycles for their hashing phase). 

These are just some of the statistics we can gather using our profiling strategy. 
One can easily change the Bookkeeping segment to calculate further statistics 
like basic blocks that are always taken, minimum amount of persistence between 
paths etc. Moreover we have already seen that some part of the bookkeeping 
needed for different statistics overlap, making the bookkeeping more efficient. 

4 Implementation Issues for TFP 

In this section we discuss some of the issues involved in implementing TFP and 
how the strategy can be modified in different situations. 
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number of blocks 

(a) 




Fig. 3. Cumulative distribution of the number of basic blocks present in the profiled 
code regions in (a) INT Benchmarks and (b) FP Benchmarks 



4.1 Use of Variables and Registers 

Much of TFP’s value relies on the fact that it uses only a few variables to achieve 
profiling as well as to maintain the information gathered. Traditional profilers 
can consume large amounts of memory to store profiled data, thereby affecting 
the runtime performance. TFP shows that it is possible to maintain a fairly wide 
and relevant range of runtime information by using only a few variables. This, 
however is based on the assumption that the number of blocks in the instru- 
mented region is not too large. If the number of blocks in a region instrumented 
by TFP is small allowing a single register to be used for representing all the 
blocks. This helps in reducing the overheads of TFP as we avoid reloading val- 
ues from memory every time profiling occurs and all the data needed for profiling 
can be maintained in a single register. The bookkeeping may need a few extra 
variables (depending on the amount of information we want to gather) but still 
this would be significantly less than using large arrays to store the frequencies 
of every path/block/edge. 

To test our assumption that a single register is sufficient to store temporal 
program behavior, we profiled the code regions covered by the most frequently 
executed back edges in the SPEC 2000 benchmarks 1 to see how many basic 
blocks they cover. The results are shown in Figure 3. Observe that more than 
99% of these frequently executed code regions have less than 64 blocks in them. 
This implies that in nearly all cases a 64 bit register is sufficient to implement 
TFP efficiently. 

To use TFP for code regions having more than 64 blocks, we can use a new 
variable every time we finish instrumenting 64 blocks (assuming we are using 
a 64-bit register) i.e. instead of just using r we use (rq, f 2 , ..., r n ) as needed. At 
the end in the Bookkeeping section, instead of checking if (r = prev) we check 
if (rq = prev\ AND r 2 = prev 2 ... AND r n = prev n ) and set all 7qs to 0 after 
that. Thus we make up for not being able to store the bit stream corresponding 

1 We did not consider some trivial two block loops having just a single path. Also for 
eon and some FP benchmarks we considered less than 10 back edges as there was 
a significant drop in the frequencies of the remaining ones. 
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to a path in a single variable by maintaining parts of the bit stream in separate 
variables. However, n = \ (Num of Blocks Instrumented) / 6) ], is rarely more 
than 1 (of the 110 instrumented regions only one had more than 64 blocks in it). 
Thus at most we will need a few extra variables for these codes. 

4.2 Nested Loops and Procedure Boundaries 

Till now we have discussed how TFP can be applied to an acyclic region of 
code. TFP can also be applied to multiply-nested regions of code. A simple way 
to achieve this this is to assign a separate variable to monitor different levels 
of loops. Since loops normally don’t have more than 2-3 levels of nesting, this 
should not be a problem 2 . 

Another trend of interest is to track PFP s across multiple procedure calls. 
For example one might detect a K-PFP in a procedure, even if the K runs of 
the instrumented code region are spread across multiple calls to the procedure. 
A simple way of achieving this is to declare the TFP variables used for profiling 
the procedure as static so that they are persistent across multiple procedure 
calls. 

5 Experimental Results 

5.1 Overheads of Using TFP 

We implemented TFP and ran it on 7 SPEC 2000 INT benchmarks and 6 FP 
benchmarks (the remaining benchmarks are omitted from our study since most 
of their dominant back-edges led to trivial single-path regions). Instrumentation 
was done using ATOM [22]. We instrumented the programs to detect the most 
frequently executed back edges, and then instrumented the code regions covered 
by these back edges. We omitted trivial two-block loops with a single path be- 
tween them 3 . Since ATOM itself added large overheads we decided to test the 
overheads of TFP by comparing it with our implementation of [8]. TFP did not 
maintain the path frequencies since the primary purpose of our experiments was 
to study the use of TFP in gathering PFPs. To be fair we did not save the results 
of [8] as originally done (thus preventing it from making unnecessary stores ) but 
just used it to ensure that the same path was persistently taken. TFP on the 
other hand not only tracked persistent paths but also tracked persistent sub- 
paths and untaken blocks (Section 3.3 and 3.4). For our experiments, we wanted 
to ensure that the instrumented code kept running for the entire duration of 
the program (to study its overall overheads) and therefore set a very high value 
of K . The normalized results are shown in Figure 4. 

2 The same technique can also be used to perform inter-procedural profiling using 
TFP by treating function calls as inner-loops and using separate variables to profile 
them. 

3 There remained 4 regions (out of the total 110 regions we instrumented) with only 
one static control flow path between them. The compiler should have coalesced them 
into a single block but did not do so. 
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Fig. 4. TFP vs Bala’s technique on the SPEC (a) INT Benchmarks (b) FP Benchmarks 



On average, TFP was only 5.75% slower than Bala’s method, even though it 
gathered a wider range of information (persistent sub-paths and untaken blocks) . 
For three of the FP benchmarks, TFP outperformed Bala’s method. This hap- 
pens because Bala’s method needs two instrumentation statements (a bitwise 
OR and a register shift) at each conditional edge 4 , while TFP requires a single 
instrumentation statement (a bitwise OR) at every block. For the FP bench- 
marks the paths were small and the number of blocks in a path was comparable 
to the number of conditional edges along the path, making TFP more efficient. 
For the INT benchmarks we observed that several blocks that could be coalesced 
together were left separate. Since we did not have control over the compiler, we 
instrumented each of these blocks, though ideally they would have been one 
block (reducing our overhead). Since there were no conditional edges in these 
blocks Bala’s method did not instrument them. We believe our 5.75% relative 
slow down is a good result, since Bala’s technique achieves nearly zero over- 
head profiling with adequate compiler support. We thus conclude that TFP is 
lightweight enough for runtime use on these benchmarks. 

5.2 Statistics from TFP 

In this section we present some runtime statistics collected by TFP on the SPEC 
2000 benchmarks. These statistics reveal the presence of persistent trends in 
programs which can be used for dynamic compilation. 



Persistent Paths We ran TFP over the SPEC 2000 INT and FP benchmarks 
and detected persistent paths with different values of K. The results from these 
experiments are shown in Table 1. We used static variables to track the paths as 

4 Often one needs additional conditional statements to instrument conditional edges. 
TFP instruments at the block level and does not add additional conditional state- 
ments. 
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mentioned in Section 4.2. Since we have considered the most frequently executed 
back edges, the instrumented code regions constitute a large fraction of the 
program’s actual running time. In summary, the regions of code we instrumented 
had 1961 static paths each on an average. Of these a small number of paths (« 
16 for 77=50 and ss 14 for 77=100) account for a fairly large percentage (~ 
61% for A'=50 and ss 59% for 77=100) of the total iterations in these regions 
at runtime. These paths also have the property that the code continuously 
stays in these paths for at least 50/100 iterations on average without shifting 
to the other possible paths in the region. Thus it makes sense to perform path- 
specific runtime optimizations on these paths since (i) these paths constitute 
a fair fraction of the executed code and (ii) the path-specific optimizations will 
hold true for a while. 



Persistently Untaken Blocks Information that might also be of use is the 
number of blocks that do not get executed persistently. One can remove these 
blocks from the code iterations, which might lead to several subsequent opti- 
mizations. We used TFP to detect opportunities for such optimizations. The 

5 Note that 50-PFP D 100-PFP and 50-PFP « 100-PFP implies that most of the 
PFPs with persistence 50 also had a persistence of 100. 



Table 1. Runtime Persistent Paths detected by TFP in SPEC 2000 INT and FP 
benchmarks. Path denotes the number of unique persistent paths that TFP detected 
and the percentages denote what fraction of total paths executed at runtime in the 
instrumented regions were K-persistent 



Benchmark 

Name 


Number of 
Static Paths 
Instrumented 


Persistent Path Statistics 


»-0 

II 


K =1 00 


paths 


% 


paths 


% 


ccl 


33 


11 


98.14 


10 


96.67 


gzip 


3563 


15 


0.912 


10 


0.658 


bzip2 


1057 


11 


49.11 


11 


45.71 


mcf 


130 


18 


55.81 


18 


51.91 


crafty 


826 


62 


15.54 


40 


12.13 


eon 


20 


4 


3.109 


3 


3.108 


parser 


19623 


40 


45.90 


35 


41.18 


AVG (INT) 


3607 


23 


38.36 


18.14 


35.91 


swim 


9 


4 


99.99 


4 


99.99 


applu 


48 


6 


85.71 


6 


85.71 


apsi 


27 


9 


99.99 


9 


99.99 


wupwise 


18 


4 


61.54 


4 


61.54 


mgrid 


46 


10 


99.84 


10 


99.65 


fma3d 


94 


16 


75.05 


16 


75.05 


AVG (FP) 


40.33 


8.16 


87.02 


8.16 


86.98 


AVG (net) 


1961 


16.15 


60.81 


13.53 


59.48 
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Table 2. Runtime Persistently Untaken Blocks detected by TFP in SPEC 2000 INT 
and FP benchmarks 



Benchmark 

Name 


Average Number 
of Blocks/instru- 
mented region 


% of Blocks NOT 
taken Persistently 


K=500 


K=1000 


ccl 


6.6 


38.97 


31.80 


gzip 


16.4 


22.94 


21.73 


bzip2 


13.6 


65.10 


64.15 


mcf 


10.5 


44.94 


44.06 


crafty 


24.2 


24.12 


21.59 


eon 


10.0 


0.017 


0.017 


parser 


14.6 


19.38 


17.93 


AVG (INT) 


14.27 


30.78 


28.76 


swim 


4 


0.000 


0.000 


applu 


5 


52.47 


52.47 


apsi 


4.4 


15.72 


15.72 


wupwise 


7.4 


39.18 


39.18 


mgrid 


5.5 


0.391 


0.300 


fma3d 


12.5 


54.22 


54.18 


AVG (FP) 


6.47 


27.00 


26.98 


AVG (net) 


10.67 


29.03 


27.94 



total number of such untaken blocks is provided in Table 2. We have also pro- 
vided the average number of blocks in the code regions we instrumented to give 
an estimate of how many blocks one might actually eliminate temporarily. Since 
block-reduction is a smaller sub-set of path-reduction we set higher values of K 
for these experiments(500,1000). To summarize these results - the instrumented 
code regions had on an average 10.67 blocks each, of which 29.03% blocks were 
not executed for at least 500 consecutive runs of these regions and 27.94% of the 
blocks were not executed for at least 1000 consecutive runs of these regions. 



5.3 A Case Study: RNAFold 

We studied if TFP could lead to improved program performance on RNAfold [9]. 
This computational biology application folds a given RNA sequence and returns 
its minimum free energy. The major part of the program is spent in a loop of 
the form: 



for(decomp=INFINITY; k=start_value; k<end_value; k++) 
if( decomp > (Arrayi [k] + Array2[k+1][j]) ) 
decomp = Arrays [k] + Array2[k+1][j] ; 



Though this is a predominantly memory-intensive loop, one can get some 
benefits by unrolling the loop. However, there is a a true dependence on decomp 
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between successive iterations of the loop. If we can implement aggressive un- 
rolling and decomp seldom changes, then we can get a fair amount of additional 
parallelism. However, we observed that if decomp changed frequently, unrolling 
slowed down the execution by consuming additional resources (registers etc.). For 
RNAfold it is not possible to decide at compile time whether unrolling might be 
useful, since the decision is dependent on the data values of the input arrays. 
One can use TFP to detect PFP s in the loop (either a persistent path along 
the dependence-free path, or to see if the edge leading to the dependence is ever 
taken). If we notice that the path along which decomp doesn’t change is executed 
persistently we can decide to unroll the loop. 

Ideally the instrumentation and optimization would be done in the compiler. 
However, since we used an existing compiler (gcc-2.96) that we did not have 
full control over, we hand-coded the optimization. We manually implemented 
different unrolled versions of the loop (3-level and 4-level). The original loop was 
instrumented using TFP. The instrumentation searched for certain degrees of 
persistence along the path where decomp did not change and on finding such 
a trend it passed on control to the corresponding optimized, unrolled version. 
To test the usefulness of TFP in this experiment we also ran a separate version 
of the code with just the unrolled version of the loop. We ran the program with 
four different sizes of input sequences. The results are shown in Figure 5. 

The TFP-enabled unrolled version outperforms both the original code and 
the unrolled version (without TFP). This is because the unrolled version uses 
registers and is only useful if it manages to introduce additional parallelism. The 
TFP-enabled version uses the original loop till it finds a persistent trend, and 
then dynamically transfers control to the unrolled version, making the optimiza- 
tion more profitable. Though the improvements are small, the experiments show 
that time sensitive flow information can be used to improve overall performance 
at runtime. 




input size 



Fig. 5. The normalized execution times for the three optimized versions of RNAfold 
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6 Conclusion 

In this paper we presented a new profiling strategy, TFP, designed to be used in 
the context of dynamic compilation and optimization. In such a context, profiling 
must not only provide information useful in a dynamic setting, but do so with 
low runtime overhead. Our strategy, TFP, can collect a range of time-sensitive, 
control- flow-based information which is more detailed than than that collected 
by block, edge or path profiling. Despite being more powerful, TFP’s additional 
overheads are negligible. Statistics gathered from the SPEC 2000 benchmarks, 
revealed further opportunities for profile-directed flow specific optimizations at 
runtime. We also showed a case study that demonstrates the usefulness of the 
information collected by TFP for optimization at runtime. 

We also plan on incorporating TFP in the context of a dynamic compiler 
to further explore its usefulness and actual overheads. Moreover, the amount 
of persistence ( K ) needed at runtime to actually produce benefit should be 
explored. Work is also going on to find efficient ways of using TFP to gather the 
exact path frequencies, if needed, at runtime. We plan to study if the definition 
of persistence can be relaxed (to accommodate a larger range of information) 
without adding to the overheads. 
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Abstract. To improve performance, data reorganization needs locality 
models to identify groups of data that have reference affinity. Much past 
work is based on access frequency and does not consider accessing time 
directly. In this paper, we propose a new model of reference affinity. This 
model considers the distance between data accesses in addition to the 
frequency. Affinity groups defined by this model are consistent and have 
a hierarchical structure. The former property ensures the profitability of 
data packing, while the latter supports data packing for storage units 
of different sizes. We then present a statistical clustering method that 
identifies affinity groups among structure fields and data arrays by ana- 
lyzing training runs of a program. When used by structure splitting and 
array regrouping, the new method improves the performance of two test 
programs by up to 31%. The new data layout is significantly better than 
that produced by the programmer or by static compiler analysis. 



1 Introduction 

The widespread use of hierarchical memory on today’s PCs and workstations is 
based on the assumption that programs have locality. At the early days of vir- 
tual memory design, Denning defined locality as “a concept that a program favors 
a subset of its segments during extended intervals (phases)” and locality set as 
“the set of segments needed in a given program phase” [9] . Locality set measures 
the memory demand but does not suggest how to improve it. Abu-Sufah, working 
with Kuck, used data dependence information to estimate program locality and 
to reorder the program execution for better locality [1], Thabit, working with 
Kennedy, analyzed the access affinity among data elements and used data place- 
ment to improve locality [32] . Subsequent research has examined a great number 
of locality models and their use in computation reordering, data reordering, or 
their combination. 

In this paper we restrict our attention to locality models that are used in 
data transformation. Data placement improves memory performance by group- 
ing useful data into the same or adjacent cache blocks or memory pages. On 
today’s high-end machines from IBM, SUN, and companies using Intel Itanium 
and AMD processors, the largest cache in the hierarchy is composed of blocks 
of no smaller than 64 bytes. If only one four-byte integer is useful in each cache 
block, 94% of cache space would be occupied by useless data, and only 6% of 
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cache is available for data reuse. A similar issue exists for memory pages, except 
that the utilization problem can be much worse. By grouping useful data to- 
gether, data placement can significantly improve cache and memory utilization. 

Data placement needs some model of reference affinity to tell which data 
are useful and should be grouped together. The past models are based on the 
access frequency. Thabit and many others used the frequency of data pairs called 
access affinity [32]. Chilimbi used the frequence of data “streams”, which are 
subsequences of data access [4] . The frequency model does not consider time of 
the access. For example, suppose that a program executes in three phases that 
frequently access three data pairs x and y, y and z, and z and x respectively. 
If we use only frequency information, we may group all three elements into the 
same cache block, although they are never used together. The problem becomes 
worse in grouping for larger storage units such as a memory page because the 
chance of grouping unrelated data is much greater. In 1999, Ding and Kennedy 
used a model to find arrays that are always accessed together [11]. However, the 
compiler-based model does not address locality in programs with general data 
and complex control flow. 

In this paper, we describe a new model of reference affinity. A set of data have 
reference affinity if they are always used together by a program. We say that 
they are in the same affinity group. This reference affinity model has two unique 
properties that are important for data placement. The first is consistency — the 
group of data elements are always accessed together. Placing group data in the 
same cache block always guarantees high space utilization. We will later define 
what we mean by “accessed together” and show how the consistency requirement 
can be relaxed to consider partial utilization of cache. 

The second property is that the model has a hierarchical structure. The 
largest group is the set of all program data, if we treat the entire execution as 
one unit of time. As we change the granularity of time, we find a decomposition of 
program data into groups of smaller sizes until the extreme case when each data 
element is a group. Hierarchical groups allow us to fully utilize cache hierarchy. 
An affinity group used in cache-block packing needs at most a dozen elements, 
while a group used for a memory page may need over one thousand elements. 
These two properties distinguish this reference affinity model from other existing 
models especially frequency-based models. 

The rest of this paper is organized as follows. We first define reference affinity 
and prove its consistency and hierarchical properties. We describe a new method 
for analyzing reference affinity at the source level and use it to improve cache 
utilization. This research is still in progress. We have not formulated all exten- 
sions of the basic concepts, nor have we evaluated our method on a broad class 
of programs or against alternative approaches. This is a preliminary report of 
our current findings. 
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2 Reference Affinity 

This section first defines three preliminary concepts and gives two examples of 
our reference affinity model. Then it presents its formal definition and proves its 
properties including consistent affinity and hierarchical organization. 

An address trace or reference string is a sequence of accesses to a set of 
data elements. If we assign a logical time to each access, the address trace is 
a vector indexed by the logical time. We use letters such as x, y, z to represent 
data elements, subscripted symbols such as a x ,a' x to represent accesses to a 
particular data element x, and array index such as T[a x ] to represent the logical 
time of an access a x on trace T. 

The volume distance between two accesses, a x and a y ( T[a x ] < T[a y \), in 
a trace T is the number of distinct data elements accessed in times T[a x ] , T[a x \ + 
1, . . . , T[a y \ — 1. We write it as dis(a x ,a y ). If T[a x \ > T[a y \, we let dis(a x , a y ) = 
dis(a y , a x ). If T[a x \ = T[a y \, dis(a x , a y ) = 0. The volume distance measures the 
volume of data accessed between two points of a trace. It is in contrast with the 
time distance, which is the difference between the logical time of two accesses. 
For example, the volume distance between the accesses to a and c in trace abbbc 
is two because two distinct element is accessed in abbb. Given any three accesses 
in time order, a x ,a y , and a Zl we have dis(a x ,a z ) < dis(a x ,a y ) + dis(a y ,a z ), 
because the cardinality of the union of two sets is no greater than the sum of 
the cardinality of each set. 

Mattson defined the volume distance between a pair of data reuses as LRU 
stack distance [23]. Volume distance can be measured in the same way as stack 
distance. Ding and Zhong recently gave a fast analysis technique that can mea- 
sure volume distance in traces with tens of billions of memory accesses to hun- 
dreds millions of data [13]. We use Ding and Zhong’s method in our experimental 
study, which will be presented in Section 4. 

Based on the volume distance, we define a linked path on a trace. There 
is a linked path from a x to a y ( x ^ y) if and only if there exist k ac- 
cesses, a Xl , a X2 , . . . , a Xk , such that (1) dis(a x ,a Xl ) < d A dis(a Xl , a X2 ) < dA 
... A dis(a Xk , a y ) < d and (2) xi,X 2 , ■ ■ ■ ,Xk, and x and y are all different data 
elements. In other words, a linked path is a sequence of accesses to different data 
elements, and each link (between two consecutive members of the sequence) has 
a volume distance no greater than d. We call d the link length. We will later 
restrict x\, X 2 , ■ ■ ■ , Xk to be members of some set S. If so, we say that there is 
a linked path from a x to a y with link length d for set S. 

We now explain reference affinity with two example address traces in Fig. 1. 
The “...” represents accesses to other data elements other than w, x , y, and z. 
In the first example, accesses to cc, y , and z are in three time ranges. They have 
consistent affinity because they are always accessed together. They belong to 
the same affinity group. The consistency is important for data placement. For 
example, x and w are not always used together, then putting them into the 
same cache block would waste cache space when only one of the two is accessed. 
The example shows that finding this consistency is not trivial. The accesses to 
the three data elements appear in different orders, with different frequency, and 
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xyz . . . xwzzy . . . yzvvvvvx . . . 

(1) The affinity group {x,y,z} with link length d =2 

wxwxuyzyz . . . zyzyvwxwx . . . 

(2) The affinity group {w,x,y,z} at d= 2 becomes two groups {w,x} and {y,z} at d= 1 
Fig. 1. Examples of reference affinity model and its properties 



mixed with accesses to other data. However, one property holds in all three time 
ranges — the accesses to the three elements are connected by a linked path with 
a link length of at most 2. 

As we will prove later, affinity groups are parameterized by the link length d 
and for each d , they form a partition of program data. The second example in 
Fig. 1 shows that group partition has a hierarchical structure for different link 
lengths. The affinity group with the link length of 2 is {w,x,y, z}. If we reduce 
the link length to 1, the two new groups will be {w, x } and {y, z}. The structure 
is hierarchical with respect to the link length: groups at a smaller link length are 
subsets of groups at a greater link length. The hierarchical structure is useful 
in data placement because it may find different-sized affinity groups that match 
the capacity of the multi-level cache hierarchy. 

We now present the formal definition of reference affinity. 

Definition 1 Strict Reference Affinity. Given an address trace , a set G of 

data elements is a strict affinity group (i.e. they have reference affinity) with the 
link length d if and only if 

1. for any x G G, all its accesses a x must have a linked path from a x to 

some a y for each other member y £ G, that is, there exist different el- 
ements xi,X 2 , ■ ■ ■ ,Xk G G such that dis(a x ,a Xl ) < d A dis(a Xl ,a X2 ) < 

d A ... A dis(a Xk , a y ) < d 

2. adding any other element to G will make Condition (1) impossible to hold 



The following theorem proves that strict affinity groups are consistent be- 
cause they form a partition of program data. In other words, each data element 
belongs to one and only one affinity group. 

Theorem 1 Given an address trace and a link length d, the affinity groups 
defined by Definition 1 form a unique partition of program data. 

Proof. We show that any element x of program data belongs to one and only one 
affinity group at a link length d. For the “one” part, observe that Condition (1) 
in Definition 1 holds trivially when x is the only member of a group. Therefore 
any element must belong to some affinity group. 
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We prove the “only-one” part by contradiction. Suppose x belongs to G\ 
and G2 {G\ ^ G2). Then we can show that G3 = Gi U G2 satisfies Condition 
(1). For any two elements y, z £ G3, if both belong to Gi and G2, then Condition 
(1) holds. 

Without loss of generality, assume y £ Gi A y £ G2 and z £ G2 A z ^ 
Gi. Because y,x £ G 1, any a y , must have a linked path to an a x , that is, 
there exist j/i, . . . , j/fc £ Gi and an access a x such that dis(a y , a Vl ) < d A ... A 
dis(a Vk , a x ) < d. Similarly, there is a linked path for this a x to an a z because 
x, z £ G2, that is, there exist zi,...,z m £ G2 and an access a z such that 
dis(a x , a Zl ) < d A ... A dis(a Zm , a z ) < d. 

If 2/1 , . . . , yk & {zi , . . . , z m }, then there is a linked path from a y to some a z . 
Suppose 2/1 , . . . , yi-i {zi , . . . , z m j but yi = Zt- Then we have a linked path 
from a y to a Vi . Since y,; = Zt £ G2, there is a linked path from yi to z, that is, 
there exist z [ , z' 2 , ■ ■ ■ , z' n £ G2 such that dis(a y , a yi ) < d A ... A dis(a Vii , a Vi ) < 
d A dis(a Vi , a z ' ) A ... A dis(a ' Zn , a z ) < d. Now belongs to Gi fl G2, just like 
We have come back to the same situation except the linked path from a v to a Vi 
is shorter than the path from a y to a x . We repeat this process. If y \, . . . , f/j_i ^ 
{z^, . . . , z ' n } , then we have a linked path from a y to a z . Otherwise, there must 
be yj £ {z[, . . . , z' n } for some j < i. The process cannot repeat for ever because 
each step shortens the path from y to the next chosen access by this process. It 
must terminate in a finite number of steps. We then have a linked path from a y 
to a z in G3. Therefore, Condition (1) always hold for G3. Since Gi,G 2 C G3, 
they are not the largest sets that satisfy Condition (1). Therefore, Condition (2) 
does not hold for Gi or G2. A contradiction. Therefore, x belongs to only one 
affinity group, and affinity groups form a partition. 

For a fixed link length, the partition is unique. Suppose more than one types 
of partition can result from Definition 1, then some x belongs to Gi in one 
partition and G2 in another partition. As we have just seen, this is not possible 
because G3 = Gi U G2 satisfies Condition (1) and therefore neither Gi nor G2 
is an affinity group. 

As we just proved, reference affinity is consistent because all members will 
always be accessed together. The consistency means that packing data in an 
affinity group will always improve cache utilization. In addition, the group par- 
tition is unique because each data element belongs to one and only one group 
for a fixed d. The uniqueness removes any possible conflict, which would happen 
if a data element could appear in multiple affinity groups. 

Next we prove that strict reference affinity has a hierarchical structure — an 
affinity group with a shorter link length is a subset of an affinity group with 
a greater link length. 

Theorem 2 Given an address trace and two distances d and d' (d < d 1 ), the 
affinity groups at d form a finer partition of affinity groups at d' . 

Proof. We show that any affinity group at d is a subset of some affinity group 
at d! . Let G be an affinity group at d and G’ be the affinity group at d' that 
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overlaps with G (G n G' ^ 0). Since any x,y £ G are connected by a linked 
path with link length d , they are connected by a linked path with a larger link 
length d' . According to the proof of Theorem 1, GUG' is an affinity group at d' . 
G must be a subset of G'; otherwise G' is not an affinity group because it can 
be expanded while still guaranteeing Condition (1). 

Finally, we show that elements of the same affinity group is always accessed 
together. When one element is accessed, all other elements will be accessed within 
a time range with a bounded volume distance. 

Theorem 3 Given an address trace with an affinity group G at link length d, 
any time an element x of G is accessed at a x , there exists a time range that 
includes a x and at least one access to all other members of G, and the volume 
distance of the time range is no greater than 2d|G| + 1, where |G| is the number 
of elements in the affinity group. 

Proof. According to Definition 1, for any y in G, there is a linked path from a x 
to some a y . Sort these accesses in time order. Let a w be the earliest and a v be 
the latest in the trace. There is a linked path from a w to a x . Let the sequence 
be a Xl ,a X21 . . . ,a Xk . The volume distance from a w to a x is dis{a w ,a x ). It is no 
greater than dis(a w , a Xl ) + dis(a Xl , a X2 ) + . . . + dis(a Xk , a x ), which is (k + l)d < 
\G\d. The bound of the volume distance from a x to a v is the same. Considering 
that a v needs to be included in the time range, the total volume distance is at 
most 2d\G\ + 1. 

The strict affinity requires that members of an affinity group are always 
accessed together. In many cases, a group of data may often be accessed together 
but not always. We can relaxed the first condition to require a group member 
to be accessed with other members k% of the time instead of all the time. The 
formal definition is below. The only change is the first condition. 

Definition 2 Partial Reference Affinity Given an address trace, a set G of 

data elements is a partial (k% ) affinity group with the link length d if and only 

if 

1. for any x £ G, at least k% accesses a x has a linked path from a x to some a y 
for each other member y in G 

2. adding any other element to G will make Condition (1) impossible to hold 



Partial affinity groups do not produce unique partition of program data. The 
structure is not strictly hierarchical. The loss in consistency and organization 
depends on k. As on-going work, we are currently quantifying the bound of this 
loss as a function of k. 
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3 Clustering Analysis 

The purpose of clustering analysis is to identify affinity groups. Data elements 
that tend to be accessed simultaneously should be clustered into the same group. 
We use k-means and its extension, x-means, to statistically measure the similar- 
ity of the reuse behavior of individual data elements and do the clustering. 

K-means is a popular statistical clustering algorithm. It was first proposed by 
MacQueen [22] in 1967. The optimization criterion implied in k-means is sum- 
of-squares criterion [16]. The aim is to minimize the total within-group sum of 
squares. The basic idea of the algorithm is an iterative regrouping of objects 
until a local minimum is reached[18]. A sketch of the algorithm is as the follows: 

1. Initialize with arbitrarily selected k centroids for k groups; 

2. Assign each object to the closest centroids; 

3. For each group, adjust the centroid to be the point denoted by the means of 

all objects assigned to that group; 

4. If there are any changes in step 2 or 3, go to step 2; otherwise, stop. 

One problem of k-means is that the value of k need to be specified at the 
beginning. For our affinity analysis, we may not know the optimal number of 
groups at the first place. Therefore, we also apply the extension of k-means, 
x-means [27] in our analysis. X-means relies on BIC (Bayesian Information Cri- 
terion) to compare different clusterings formed for different fc’s. Based on BIC 
calculation, it approximately measures the probability of each clustering given 
the original data set. Then the one with the highest probability is chosen. 

According to the definition given in Section 2, an accurate way to identify 
affinity groups would be recording and comparing the reference trace of each 
data element. However, the time and space overhead of this approach would 
be high. Therefore, we propose an approximate estimation of affinity groups 
according to the reuse distance distribution of individual data elements. Reuse 
distance is equivalent to volume distance between two adjacent reuses to the 
same datum. We use the efficient measurement described in [13] to collect reuse 
distance information. For an array, we do not distinguish references to different 
array elements but view the whole array as a single data element. For a structure, 
we consider the accumulated reuse distance distributions of the accesses to each 
structure field of all instances of the structure. For example, a tree structure 
composed of two fields left and right will be considered two data elements. No 
matter how many objects of this structure type will be dynamically allocated, 
the references to the first field of these objects will always be counted as the 
accesses to the first data element. The same rule is applied to the references to 
the second field of allocated objects. 

For any datum, the whole reuse distance scope for one execution is from 
zero to the maximal reuse distance that occurs in the execution. We divide this 
scope into a set of ranges, each in the form of [di, g^), where di, g ?2 are two reuse 
distance values (di < tfe)- Then for each range, we count the number of references 
whose reuse distance falls into that range and calculate the average distance of 
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Table 1. Example of statistics collected for each data element and used in clustering 



Range No. 


Range scope 


Reuse distance set 


No. reuses 


Avg. reuse distance 


1 


[2048, 4096) 


{2500, 3000, 3500, 4000} 


4 


3250 


2 


[4096, 6144) 


{4500, 5000, 5500, 6000} 


4 


5250 


3 


[6144, 8192) 


{6500, 7000} 


2 


6750 



these reuses. The set composed of the number of references within each range 
forms a counter vector. The set of the average distance calculated for each range 
forms a distance vector. These two vectors are collected for every data element 
that we target to be grouped. They each describes the reuse behavior of the 
corresponding data element by locating them in an N dimensional space, while N 
is the number of reuse distance ranges considered. Since references with a long 
reuse distance have more significant effect on performance, we emphasize on 
such references in clustering analysis. For all experiments reported in this paper, 
only references with a reuse distance longer than 2048 are used in clustering and 
the reuse distance ranges are divided linearly with a constant length of 2048. 
In another word, the reuse distance ranges we considered in clustering analysis 
begin with [2048, 4096) and go on with [4096, 6144), [6144, 8192), ..., and so forth. 

An example with the above setting is given in Table 1. Suppose there 
are 10 reuses to the left field of the instances of a tree structure. Their 
reuse distances are (sorted incrementally): {2500,3000,3500, 4000, 4500,5000, 
5500,6000,6500,7000}. We will construct 3 reuse distance ranges for this field 
and Table 1 describes the statistics collected for each range. The last two columns 
of the table list the two vectors to be clustered as (4, 4, 2) and (3250, 5250, 6750) 
respectively. 

Our overall algorithm is shown in Fig. 2. 



algorithm Clustering Analysis 

input: V-Data: reuse distance distribution vectors for each data elements 
output: optimal data grouping 

begin 

Apply x-means on V-Data, K = No. of clusters recommended by x-means; 
for (i = K-l ; i<K+l;i++) 

Apply k-means on V-Data with i as the specified No. of clusters; 

end for 

Try training runs with all groupings found by x-means and k-means; 
Compare performance of training runs and find the optimal data grouping; 

end 

end algorithm 

Fig. 2. Clustering analysis for data grouping 
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4 Evaluation 

In this section, we measure the effect of the affinity group clustering by reor- 
ganizing data layout according to clustering results and comparing performance 
changes. 



Two Test Programs We test on two different programs: Cheetah and Swim. 
Cheetah is a cache simulator written in C included in SimpleScalar suite 3.0. 
It constructs a splay tree to store and maintain the cache content. The tree 
structure of Cheetah is composed of seven fields. In our experiments, we check 
the reuse behavior of each individual field within the tree structure and clus- 
ter them into groups. According to different clustering, we implement different 
versions of structure splitting on the source file and compare their performance. 
Swim from Spec95 calculates finite difference approximations for shallow water 
equation. The reuse distance distributions of the fourteen arrays of real type in 
this Fortran program are collected and used in clustering. Then, the source file 
is changed by merging arrays clustered in the same group into a single array. 
Again, the performance of different versions are compared. These experiments 
also explore the potential uses and benefits of data clustering based on locality 
behaviors. 



Clustering Methods We use the k-means and x-means analyzing tool im- 
plemented by Pelleg and Moore at Carnegie Mellon University [27]. Each data 
object to be clustered is represented by a set of feature values, each collected for 
a given reuse distance range. Two types of feature are considered: the number 
of reuses and the average reuse distance. The reuse distance ranges have a fixed 
length of 2048, as described in Section 3. 



Platforms The experiments are performed on three different machines, includ- 
ing a 250MHz MIPS R10K processor with SGI MISPro compiler, a Sun Sparc 
U4500/336 processor and a 2GHz Intel PentiumlV processor with Linux gcc 
compiler. All programs are compiled with optimization flag -n32 -Ofast or -03 
respectively. 



Structure Splitting Table 2 lists the clustering results for the tree structure in 
Cheetah. The input to Cheetah simulator is an access trace from JPEG encoding 
images sizing from dozens to thousands of bytes. 

Although the tree structure consists of seven fields: rtwt , rt , Ift , inum , addr, 
grpno and prty , the table only list the first five of them. The reason is the 
simulator for fully associative LRU cache only accesses the first five fields and 
we apply clustering only on them. Clustering the other two is trivial. The first 
column of the table gives the size of the encoded image. Column 2 and 3 describe 
how the the clustering is applied. The fourth column contains the number of 
clusters identified while the last column lists the clustering result. While k-means 
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Table 2. Clustering for tree structure of Cheetah. K-means gives results for 2 to 4 
clusters, we show only 2 and 3 



Input image size 


Clustering data 


Clustering method 


No. clusters 


Clustering 


43 bytes 


No. reuses 


k-means 


3 


(addr) (inum rt) (1ft rtwt) 


2 


(addr) (inum lit rt rtwt) 


x-means 


3 


(addr inum rt) (1ft) (rtwt) 


Avg. reuse 
distance 


k-means 


3 


(addr) (inum 1ft rtwt) (rt) 


2 


(addr) (inum 1ft rt rtwt) 


x-means 


3 


(addr) (inum lit rtwt) (rt) 


969 bytes 


No. reuses 


k-means 


3 


(addr) (inum 1ft rtwt) (rt) 


2 


(addr) (inum lit rt rtwt) 


x-means 


3 


(addr) (inum lit rtwt) (rt) 


Avg. reuse 
distance 


k-means 


3 


(addr) (inum 1ft rtwt) (rt) 


2 


(addr) (inum 1ft rt rtwt) 


x-means 


3 


(addr) (inum 1ft rtwt) (rt) 


2.58K bytes 


No. reuses 


k-means 


3 


(addr) (inum 1ft rtwt) (rt) 


2 


(addr) (inum lit rt rtwt) 


x-means 


3 


(addr) (inum 1ft rtwt) (rt) 


Avg. reuse 
distance 


k-means 


3 


(addr) (inum) (1ft rt rtwt) 


2 


(addr inum) (1ft rt rtwt) 


x-means 


3 


(addr) (inum) (1ft rt rtwt) 



Table 3. Different structure splitting for Cheetah 



Version No. 


Grouping 


orig 


(addr inum grpno 1ft prty rt rtwt) 


vl 


(addr) (inum) (grpno) (1ft) (prty) (rt) (rtwt) 


v2 


(addr) (inum) (grpno) (1ft rtwt) (prty) (rt) 


v3 


(addr) (inum) (grpno) (1ft) (prty) (rt rtwt) 


vA 


(addr) (inum 1ft rt rtwt) (grpno) (prty) 



gives results for two to four clusters, we only show groupings with two and three 
clusters here. 

The clustering results shown in Table 2 have two important features. First, 
the clustering on the five fields varies across different inputs or different clustering 
algorithms. Second, although there is no single winner, the clustering indicates 
a strong affinity among rtwt , rt , Ift and inum. Therefore, we choose to reorganize 
the tree structure by grouping these four fields in different ways. Table 3 lists 
the structure splittings we tested. 

Each row of Table 3 describes a grouping of the seven fields of the tree struc- 
ture. Version orig is an array-based version of the original Cheetah. In orig , the 
tree nodes are contained in a big pre-allocated array instead of dynamically allo- 
cated at run time. This array-based Cheetah simulator is faster than the original 
Cheetah (over 10% faster when tested on an SGI processor). All other versions 
modify version orig to implement structure splitting. VI divides the structure 
into seven groups, each containing a single field. The other three versions group 
in different ways according to the similarity measured by the clustering analysis. 
We change the source files by hand to get different versions. The access trace 
of JPEG encoding a testing image size of 272 KB is used as the testing input. 
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Table 4. Performance of different structure splittings for Cheetah 



Version No. 


2GHz Intel PentiumlV 


250MHz MIPS R10K 


336MHz UltraSparc II 


orig 


23.4s 


59.7s 


93.7s 


vl 


23.7s 


71.9s 


91.7s 


v2 


20.7s 


59.2s 


91.2s 


v3 


20.1s 


58.2s 


92.5s 


vA 


19.3s 


56.0s 


89.1s 


Best improvement 
over orig(%) 


17.5 


6.20 


4.93 


Best improvement 
over vl(%) 


18.6 


22.1 


2.84 



Table 5. Clustering for arrays of Swim. K-means gives results for 2 to 13 clusters, 
we show only 7 and 8 



Ulustering 

data 


Ulustering 

method 


JNo. of 
clusters 


Ulustering 


V ersion 
No. 


No. of 
reuses 


k-means 


8 


(cu h)(cv z)(p u v)(pnew unew)(pold uold) ( psi) ( vnew) ( void) 


vl 


7 


(cu cv h z)(p)(pnew unew vnew)(pold uold void )( psi) ( u) (v) 


v2 


x-means 


8 


(cu cv h z)(p u v) ( pnew) (pold uold) ( psi) ( unew) (vnew) (void) 


v3 


Avg. reuse 
distance 


k-means 


7 


(cu cv h z)(p)(pnew unew vnew ) ( pold uold void) (psi) (u) (v) 


v2 


x-means 


8 


(cu cv h z)(p)(pnew unew vnew)(pold uold) (psi) (u) (v) (void) 


v4 


static analysis 


(cu cv)(h z)(p)(pnew unew vnew ) ( pold uold vold)(psi)(u v) 


static 



Different versions of Cheetah are compiled and run on the three platforms de- 
scribed above and the running times are collected by the standard time utility. 
Table 4 summaries the experiment results. 

For each version listed, Table 4 gives the execution time on the three plat- 
forms. The user time reported by time command is used as the execution time. 
The comparison between the first and second rows of the table shows there is no 
clear benefit by simply dividing the original structure into individual fields. Ver- 
sion v\ runs slower than orig on both MIPS and Pentium machines. However, 
by grouping the fields with similar reuse behavior together, there is almost al- 
ways a performance gain from other versions. Version v4 is consistently the best 
among all the groupings. It is up to 17.5% faster than the original version. This 
shows the clustering analysis based on reuse distance distribution is effective in 
identifying affinity groups. 

Array Grouping Swim has fourteen arrays with the same size. We apply 
clustering analysis on these arrays and merge arrays in the same cluster. Table 5 
describes a subset of the clustering results. We tested the performance of these 
groupings. The training data for clustering analysis was collected by running 
Swim with an input matrix of size 32 x 32. 

The last row of Table 5 includes an array grouping based on static analysis. 
It was obtained by compiler analysis [11]. One restriction of array merging is the 
original arrays must have exactly the same size in all dimensions. This can be 
checked manually or by a compiler. The transformation process to get different 
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Table 6. Performance of different array groupings for Swim 



Version No. 


2GHz Intel PentiumlV 


250MHz MIPS R10K 


336MHz UltraSparc II 


orig 


72.7s 


156.7s 


268.1s 


static 


57.5s 


155.3s 


239.6s 


vl 


59.8s 


153.9s 


232.5s 


v2 


50.1s 


156.0s 


226.7s 


v3 


54.1s 


155.3s 


221.4s 


vA 


50.1s 


145.8s 


226.7s 


Best improvement 
over orig(%) 


31.1 


6.96 


22.5 


Best improvement 
over static (%) 


12.9 


6.12 


13.2 



grouping versions at source-level is semi-automatic. We tested Swim for an input 
matrix of size 512 x 512. Table 6 gives the execution time for the original Swim 
and all the five grouping versions. 

Table 6 shows that the array groupings identified by clustering analysis out- 
performs the grouping based on static analysis most of the time. Version i4, 
identified as the optimal clustering by x-means method, is the best one on all 
machines. It reduces the execution time by up to 31.1% compared to the original 
version and up to 13.2% compared to the static analysis version. 



5 Related Work 

An effective method for fully utilizing cache is to make data access contiguous. In- 
stead of rearranging data, the early studies reordered loops so that the innermost 
loop traverses data contiguously within each array. Various loop permutation 
schemes were studied for perfect loop nests or loops that can be made perfect, 
including those by Abu-Sufah et al. [2], Gannon et al. [15], Wolf and Lam [33], 
and Ferrante et al. [14]. McKinley et al. developed an effective heuristic that per- 
mutes loops into memory order for both perfect or non-perfect nested loops [24]. 
Loop reordering, however, cannot always achieve contiguous data traversal be- 
cause of data dependences. This observation led Cierniak and Li to combine 
data transformation with loop reordering [6]. Kremer developed a general for- 
mulation for finding the optimal data layout that is either static or dynamic for 
a program at the expense of being an NP-hard problem and showed that it is 
practical to use integer programming to find an optimal solution [20]. Compu- 
tation reordering is powerful when applicable. However, in many programs, not 
all data accesses in all programs can be made contiguous. 

Alternatively, we can pack data that are used together. Early studies used 
the frequency of data access, measured by sample- and counter-based profiling 
by Knuth [21] and static probability analysis by Cocke and Kennedy [7] and by 
Sarkar [29]. Frequency information is frequently used in data placement, as we 
reviewed in the introduction. In addition, Chilimbi et al. split Java classes based 
on the access frequency of class members [5]. In addition to packing for cache 
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blocks, Seidel and Zorn packed dynamically allocated data in memory pages [30]. 
Access frequency does not distinguish the time of access: that a pair or a group 
of data are frequently accessed does not mean that they are frequently accessed 
together, and that a group of data are accessed together more often than other 
data does not mean the data group are accessed together always or most of 
the time. In a recent paper, Petrank and Rawitz formalized this observation 
and proved a harsh bound: with only pair-wise information, no algorithm can 
find a static data layout that is always within a factor of k — 3 from the optimal 
solution, where k is proportional to the size of cache [28]. Unlike reference affinity, 
the frequency-based models do not find data groups that are always accessed 
together. Neither do they partition data in a hierarchy based on their access 
pattern. 

Eggers and Jeremiassen grouped data fields that were accessed by a parallel 
thread to reduce false sharing [19]. Ding and Kennedy regrouped Fortran arrays 
that are always accessed together to improve cache utilization [11]. Ding and 
Kennedy later extended it to group high-dimensional data at multiple granular- 
ity [ 2]. While the previous studies used compiler analysis, this work generalizes 
the concept of reference affinity to address traces. It also proposes a new profiling- 
based method for finding affinity groups among source-level data. Preliminary 
results show that the new method out-performs the data layout given by either 
the compiler or the programmer. 

The above data packing methods are static and therefore cannot fully op- 
timize dynamic programs whose data access pattern changes during execu- 
tion. Dynamic data placement was first studied under an inspector-executor 
framework [8]. Al-Furaih and Ranka examined graph-based clustering of irreg- 
ular data for cache [3]. Other models include consecutive packing by Ding and 
Kennedy [10], space-filling curve by Mellor-Crummey et al. [25], graph parti- 
tioning by Han and Tseng [17], and bucket sorting by Michell et al [26]. Several 
studies found that consecutive packing compared favorably with other mod- 
els [25, 31], 

6 Summary 

We have defined a new reference affinity model and proved its three basic prop- 
erties: consistent groups, hierarchical organization, and bounded reference dis- 
tance. We have described a clustering method to identify affinity groups among 
source-level structure fields and data arrays. The method uses data reuse statis- 
tics collected from training runs. It uses k-means and x-means clustering al- 
gorithms as a sub-procedure and explores a smaller number of choices before 
determining the reference affinity. When used by structure splitting and array 
grouping, the new method reduces execution time by up to 31%. It outperforms 
previous compiler analysis by up to 13%. As on-going work, we are formulat- 
ing partial reference affinity, studying more accurate ways of reference affinity 
analysis, and exploring other uses of this locality model in program optimization. 
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Abstract. The wide use of multiprocessor system has been making au- 
tomatic parallelizing compilers more important. To improve the per- 
formance of multiprocessor system more by compiler, multigrain par- 
allelization is important. In multigrain parallelization, coarse grain task 
parallelism among loops and subroutines and near fine grain parallelism 
among statements are used in addition to the traditional loop parallelism. 
In addition, locality optimization to use cache effectively is also impor- 
tant for the performance improvement. This paper describes inter-array 
padding to minimize cache conflict misses among macro-tasks with data 
localization scheme which decomposes loops sharing the same arrays to 
fit cache size and executes the decomposed loops consecutively on the 
same processor. In the performance evaluation on Sun Ultra 80(4pe), 
OSCAR compiler on which the proposed scheme is implemented gave 
us 2.5 times speedup against the maximum performance of Sun Forte 
compiler automatic loop parallelization at the average of SPEC CFP95 
tomcatv, swim hydro2d and turb3d programs. Also, OSCAR compiler 
showed 2.1 times speedup on IBM RS/6000 44p-270(4pe) against XLF 
compiler. 



1 Introduction 

Multiprocessor architectures are currently used in wide range of computers in- 
cluding high performance computers, entry level servers and games embedding 
chip multiprocessors. To improve usability and effective performance of multi- 
processor systems, automatic parallelizing compilers are required. To this end, 
automatic parallelizing compilers have been researched. For example, Polaris[l] 
compiler exploits loop level parallelism by using symbolic analysis, runtime data 
dependence analysis, range test and so on. Loop parallelization considering the 
data locality optimization using unimodular transformation, affine partitioning 
and so on has been researched in SUIF compiler [2]. 

Since various kinds of loops can be parallelized by those advanced compilers, 
to further improve the effective performance of multiprocessor systems, the use 
of different grains of parallelism such as the use of coarse grain task parallelism 
among loops and subroutines and fine grain parallelism among statements and 
instructions in addition to loop level parallelism should be considered. NANOS 
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compiler [3] uses multi level parallelism by using the extended OpenMP API. 
PROMIS compiler [4] integrates loop level parallelism and instruction level par- 
allelism using a common intermediate language. Multigrain parallel processing 
which has been realized in OSCAR compiler [5], APC compiler (Advance Paral- 
lelizing Compiler developed by Japanese millennium project IT21)[6] uses coarse 
grain task parallelism among loops and subroutines and near fine grain paral- 
lelism among statements. 

Also, optimization for memory hierarchy to minimize the memory access 
overhead that is getting larger with the speedup of a processor is important to 
improve the performance. Loop restructurings such as loop permutation, loop 
fusion and tiling to change data access pattern in a loop are researched as the 
cache optimization by the compiler. Data layout transformations including strip 
mining and array permutation to make data access pattern contiguous are also 
researched. Intra-array padding and inter-array padding to reduce conflict misses 
in a single loop or a fused loop are proposed [7]. Also, the loop fusion scheme 
using peeling and shifting of loop iteration to allow fusion and maintain loop 
parallelism has been used to enhance data locality [8]. Furthermore, after loop 
fusion, conflict misses can be reduced by cache partitioning [9]. 

The performance of physically-indexed cache depends on the page placement 
policy of operating system such as page coloring and bin hopping[10]. Runtime 
recoloring scheme using the extended hardware such as Cache Miss Lookaside 
buffer to traces cache conflict misses has been proposed [11]. Low overhead re- 
coloring using extended TLB to record the cache color is also researched [12], In 
addition to these approaches requiring the extended hardware, OS and compiler 
cooperative page coloring scheme without hardware extension using information 
on access pattern of program provided by compiler is proposed [13]. 

This paper proposes the padding scheme to reduce conflict misses to improve 
the performance of the coarse grain task parallel processing. In the cache opti- 
mization for coarse grain task parallel processing [14], at first, complier divides 
loops into smaller loops to fit data size accessed by loops to cache size. Next, 
the compiler analyzes parallelism among tasks including the divided loops us- 
ing Earliest Executable Condition analysis and schedules tasks which shared the 
same data to the same processor so that the tasks can be executed consecutively 
accessing the shared data on the cache. After that, cache line conflict misses 
among tasks which are executed consecutively are reduced by padding proposed 
in this paper. Although ordinary cache optimizations by the compiler target 
a single loop or a fused loop, the proposed scheme optimizes cache performance 
over loops. 

The rest of this paper is organized as follows. In section 2, the coarse grain 
task parallel processing is described. Section 3 describes the cache optimiza- 
tion scheme using data localization for the coarse grain task parallel processing. 
Section 4 proposes the padding scheme to reduce conflict misses over loops. 
The effectiveness of the proposed schemes is evaluated on the commercial mul- 
tiprocessors using several benchmarks in SPEC CFP95 in section 5. Finally, 
concluding remarks are described in section 6. 
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Fig. 1. An Example of Macro- Task Graph 



2 Coarse Grain Task Parallel Processing 

This section describes coarse grain task parallel processing to which the proposed 
cache optimization scheme is applied. In the coarse grain task parallel process- 
ing, a source program is decomposed into three kinds of coarse grain tasks, 
or macro-tasks, namely block of pseudo assignment statements(BPA) repetition 
block(RB), subroutine block(SB). Also, macro-tasks are generated hierarchically 
inside of a sequential repetition block and a subroutine block. 



2.1 Generation of Macro-Task Graph 

After the generation of macro-tasks, compiler analyzes data flow and con- 
trol flow among macro-tasks in each layer or each nested level. Next, to ex- 
tract parallelism among macro-tasks, the compiler analyzes Earliest Executable 
Condition(EEC)[5] of each macro-task. 

EEC represents the conditions on which macro-task may begin its execution 
earliest. 

EEC of macro-task is represented in macro-task Graph (MTG) as shown in 
Fig.l. In macro-task graph, nodes represent macro-tasks. A small circle inside 
nodes represents conditional branches. Solid edges represent data dependencies. 
Dotted edges represent extended control dependencies. Extended control depen- 
dency means ordinary control dependency and the condition on which a data 
dependent predecessor macro-task is not executed. A solid arc represents that 
edges connected by the arc are in AND relationship. A dotted arc represents 
that edges connected by the arc are in OR relation ship. 

2.2 Macro-Task Scheduling 

In the coarse grain task parallel processing, static scheduling and dynamic 
scheduling are used for assignment of macro-tasks to processors. 
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If a macro-task graph has only data dependencies and is deterministic, static 
scheduling is selected. In the static scheduling, assignment of macro-tasks to pro- 
cessors is determined at compile time by the scheduler in the compiler. Static 
scheduling is useful since it allows us to minimize data transfer and synchroniza- 
tion overhead without runtime scheduling overhead. 

If a macro-task graph has control dependencies, dynamic scheduling is se- 
lected to cope with runtime uncertainties like conditional branches. Scheduling 
routine for dynamic scheduling are generated by compiler and embedded into 
a parallelized program with macro-task code. 

2.3 Code Generation 

OSCAR compiler has several backends and generates the parallelized code for 
multiple target architectures. In this paper, OpenMP backend is used to generate 
OpenMP FORTRAN from sequential FORTRAN. OSCAR compiler generates 
the portable code for various shared memory multiprocessors by using “one-time 
single code generation” technique [5, 15]. Furthermore, by using native compiler 
as the backend of OSCAR compiler, general optimizations and machine specific 
optimizations provided by it are applied to the generated code. Therefore, the 
performance of OSCAR compiler can be used as a performance booster of the 
native compiler on the state of the art multiprocessor. 



3 Cache Optimization for Coarse Grain Task Parallel 
Processing 

If macro-tasks that access the same data are executed consecutively on the same 
processor, shared data can be transffered among these macro-tasks using fast 
memory such as cache. This section describes cache optimization using data 
localization [16] to enhance the performance of coarse grain task parallel process- 
ing. 



3.1 Loop Aligned Decomposition 

To avoid cache misses among the macro-tasks, Loop Aligned 
Decomposition(LAD)[16] is applied to loops that access large size data. 
LAD divides a loop into partial loops with the smaller number of iterations so 
that data size accessed by the divided loops is smaller than cache size. 

Partial loops are treated as coarse grain tasks and the Earliest Executable 
Condition(EEC) analysis is applied. 

Partial loops connected by data dependence edge on the macro task graph are 
grouped into “Data Localization Group(DLG)”. Partial loops, or macro-tasks, 
inside a DLG are assigned to the same processor as consecutively as possible by 
static or dynamic scheduler. 

In macro-task graph of Fig. 2(a), it is assumed that macro-tasks 2, 3 and 7 are 
parallel loops and they access the same shared variables and their size exceeds 
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Fig. 2. Example of Loop Align Decomposition 



cache size. In this example, loops are divided into four partial loops by the LAD. 
For example, macro-task 2 in Fig. 2(a) is divided into macro-task 2_A through 
2JD in Fig. 2(b). Also, DLGs are defined, for example, 2_A, 3_A, 7_A are grouped 
into DLG_A. 

3.2 Scheduling for Consecutive Execution of Macro-Tasks 

Macro-tasks are executed in the increasing order of the node number on the 
macro-task graph in the original program. For example, the execution order of 
macro-tasks 2_A to 3_D is 2_A, 2_B, 2_C, 2_D, 3_A 3_B, 3_C, 3_D. In this order, 
macro-tasks in the same DLG are not executed consecutively. 

However, the earliest executable condition shown in Fig. 2(b) means that 
macro-task 3_B, for example, can be executed immediately after macro-task 2_B 
because macro-task 3_B depends on only macro-task 2_B. 

In the proposed cache optimization scheme, a task scheduler for the coarse 
grain tasks assigns macro-tasks inside a DLG to the same processor as consecu- 
tively as possible[14] in addition to “critical path” priority. Fig. 3 shows a schedule 
when the proposed cache optimization is applied to macro-task graph in Fig. 2(b) 
for a single processor. As shown in Fig. 3, macro-task 2_B, 3_B, 8_B in DLGJB 
and macro-task 2_C, 3_C, 7_C in DLG_C are executed consecutively to use cache 
effectively. 

4 Reduction of Cache Conflict Misses 

This section describes the data layout transformation using padding to reduce 
conflict misses among macro-tasks in a DLG. 



I- OLG_A hi DLG_B 1| DLG_C | K DLG_D H DLG_A DLG_D 

time 



Fig. 3. Example of Scheduling Result on Single Processor 
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4.1 Conflict Misses in a DLG 

In the data localization, loops accessing the same shared variable larger than 
cache size are divided to smaller loops or macro-tasks. Furthermore, macro- 
tasks in the same DLG are executed consecutively on the same processor. This 
enables the shared data to be reused before cache out. However, if data accessed 
by macro-tasks in a DLG share the same line on the cache, data may be removed 
from the cache because of line conflict miss even though data size accessed in 
a DLG is not larger than the cache size. 

Conflict misses in a DLG are reduced by data layout transformation by inter- 
array padding. In this section, SPEC CFP95 swim is used as an example for the 
proposed padding scheme. Swim has 13 single precision 513x513 arrays and each 
size is about 1MB. Fig. 4 shows the data layout image on cache where 13 arrays 
are allocated to 4MB direct map cache. In this figure, boxes framed by thick 
lines show arrays. Horizontal direction represents 4MB cache space. This figure 
means that arrays on the same vertical position are allocated to the same cache 
lines and they cause line conflict misses. For example, arrays U, VNEW, POLD 
and H are allocated to the same part of cache. 

Dotted lines in the figure show the partial arrays accessed by the divided 
loops by the LAD when loops are divided to 4 smaller loops. Gray part of each 
array shows a partial array accessed by the divided loops in a DLG. As shown 
in the figure, conflict misses may be caused among the partial arrays accessed 
in a DLG, or on a vertically same position. 

This conflict misses interfere the data reuse among the consecutively executed 
macro-tasks. Data layout transformation by the padding to reduce conflict misses 
in a DLG is required for the cache optimization among loops or macro-tasks. 
This section describes the padding scheme to reduce conflict misses in a DLG. 

4.2 Inter-Array Padding 

This section describes an inter-array padding procedure using array declaration 
size change. 



Stepl Select Target Arrays Since OSCAR compiler on which the proposed 
scheme is implemented generates the parallelized OpenMP FORTRAN, the ac- 
tual data layout is determined by the machine native compiler which is used 
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as the back end of OSCAR compiler. Therefore, in the current implementation, 
OSCAR compiler chooses arrays of the same size as the target of the proposed 
padding and changes declaration size of the target arrays to realize inter-array 
padding. 

Arrays in FORTRAN “common block” are also chosen as the target of inter- 
array padding if a common block has the same shape over all program modules 
because changing declaration size of such arrays dose not break the program 
semantics. Padding for arrays in common block that has different shapes are 
described in section 4.3. 

Step2 Generate Data Layout Image on Cache Next, a compiler calculates 
addresses of selected arrays and generates data layout image on cache as shown 
in Fig. 4. In this step, because all target arrays have the same size, a compiler can 
determine the data layout image regardless of the actual data layout determined 
by the native compiler. 

Step3 Calculate Minimum Division Number A compiler calculates the 
minimum division number ( divjnum ) to make data size accessed in a DLG 
smaller than the cache size by dividing total array size of target arrays by cache 
size. In the example in Fig. 4, total array size is 13MB and cache size is 4MB. 
Then, divjnum is ce*^(13/4) = 4. 

Step4 Calculate Maximum DLG Access Size The maximum data size 
accessed in a DLG ( partsize , gray range in Fig. 4) is calculated by dividing 
array size by divjnum. If there are overlaps among partial arrays of partsize in 
data layout image on cache, it means that conflicts may be caused among arrays 
accessed in a DLG. If there is no overlap, padding is not applied. 

Step5 Calculate Padding Size To remove conflict, the distance on the cache 
between the base address of first array (array U in Fig. 4) and the base address of 
first array after cache size (array VNEW) should be partsize. Padding size to 
remove conflict between U and VNEW is cachesize + partsize — base-address 
where base-address is the base address of VNEW. Similarly, same size pads are 
inserted to remove all conflicts as shown in Fig. 5 (a). 

Step6 Change Array Size In the proposed scheme, pads inserted among cer- 
tain arrays as shown in Fig. 5(a) are distributed to all arrays so that the data lay- 
out dose not depends on the specific order of arrays. In practice, the rightmost di- 
mension of each array is changed to increase array size by padding size fnarr ay s, 
where narrays is the number of arrays in the range from the beginning to 
cachesize + part-size( 4 in this example). Fig. 5(b) shows the data layout image 
on cache after the proposed padding by changing array size. 

4.3 Padding for Common Block 

Some program modules may have the different array declarations size for a com- 
mon block. Because padding among arrays in such common block may change 
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Fig. 5. Inter- Array Padding for Swim 



the program semantics, it is difficult to apply inter-array padding to such ar- 
rays. Therefore, a compiler merges such common blocks to single large common 
block and inserts pads among common blocks to maintain program semantics 
and reduce conflict misses among arrays in common blocks. 

4.4 Set Associative Cache 

In the current implementation, the proposed padding targets LRU replacement 
policy for a set associative cache. A set associative cache is treated as a direct 
map cache of same size. If padding removes conflicts on a direct map cache, the 
number of overlaps on n-way cache is smaller than n because the data layout 
image on cache of n-way set associative is same as that of a direct map cache of 
1/n size. Therefore, there is no conflict on an n-way cache because a cache set 
of n-way cache can hold n lines. 

4.5 Page Placement Policy of Operating System 

Data layout transformation by a compiler is made on virtual address. Therefore, 
page placement policy of operating system to map a virtual address to a physical 
address affects it on a physically-indexed cache. 

A simple page coloring maps sequential virtual pages to sequential physical 
pages. Therefore, a page conflicts with the page apart from it by cache size. Data 
transformation by a compiler is effective in this policy because continuity of the 
address on virtual address is kept on physical address. 

In bin hopping, sequential physical pages are assigned to virtual pages in 
the order of page fault, irrespective of their virtual address. Continuity on the 
virtual address is not remained on physical address in this policy. Therefore, it 
is difficult that a compiler applies data layout transformation effectively beyond 
the page size on virtual address. 

5 Performance Evaluation 

This section describes the performance evaluation of the proposed scheme on Sun 
Ultra 80 and IBM RS/6000 44p-270. Ultra80 has four 450MHz Ultra SPARC- 
IIs with 4MB direct map L2 cache for each processor and RS/6000 has four 
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375MHz Power3s with 4MB 4- way set associative L2 cache(LRU). Both caches 
are physically-indexed caches. Solaris 8 on Ultra 80 and AIX 4.3 on RS/6000 
support page coloring and bin hopping. 

In the evaluation, sequential FORTRAN programs are translated into par- 
allelized OpenMP FORTRAN programs using OSCAR compiler on which the 
proposed scheme has been implemented. Three kinds of compilation, namely OS- 
CAR with the proposed padding, OSCAR without the padding and automatic 
parallelization by the machine native compiler are compared. 

SPEC CFP95 tomcatv, swim, hydro2d and turb3d are used in this eval- 
uation. Original sources code of SPEC are used by both OSCAR and native 
compiler for tomcatv, swim and hydro2d. However, turb3d is preprocessed by 
APC compiler[6] in order to parallelize some loops containing subroutine calls 
because both OSCAR and native compilers currently cannot parallelize such 
loops. 

Since data size of programs used in this evaluation are about ten MB, the tar- 
get of the proposed padding with data localization in this evaluation is L2 cache 
that has larger miss penalty and larger impact on performance than LI cache of 
32KB or 64KB. In this evaluation, the number of loops generated from a loop 
by loop division is same as the number of processors. Therefore, performance 
improvement is obtained mainly by the proposed padding. 

The proposed inter-array padding extends 513x513 2-dimensional array to 
513x573 for tomcatv, 513x513 to 513x544 for swim, 66x64x64 to 66x64x71 for 
turb3d. The padding for common blocks is applied to hydro2d. Four common 
blocks, VAR1, VAR2, VARH and SCRA, are merged to a common block and a 
dummy array of 318696 bytes is inserted between VAR2 and VARH. 

5.1 Performance on Sun Ultra 80 

Solaris 8 supports Hashed VA, V.addr=P.addr and bin hopping as the page 
placement policy. V.addr=P.addr method keeps continuity on virtual address on 
physical address. Hashed VA is similar to V.addr=P.addr but it inserts a small 
gap every L2 cache size (4MB) to avoid conflict miss among two addresses, dis- 
tance among which is just L2 cache size. Default policy of Solaris 8 is Hashed VA. 

Speedups for 4PEs against sequential execution by Sun Forte 6 update 2 
compiler on Sun Ultra 80 are shown in Fig. 6. Numbers above the bar in the 
figure show execution times. In addition, the number of cache misses measured 
by CPU Performance count of Ultra SPARC-II is shown in Fig. 7. 

Speedups on Hashed VA by the automatic parallelization of Forte for tom- 
catv, swim and hydro2d are only 1.2, 1.7 and 1.8 times against sequential ex- 
ecution respectively as shown in Fig. 6(a). Also, speedups by OSCAR without 
padding are 1.4, 1.7 and 2.3 times, since conflict misses prevent the scalability. 
For example, the number of cache misses of swim by Forte automatic paralleliza- 
tion is 300 million and that of OSCAR without padding is also 300 million as 
shown in Fig. 7(a). These are not much reduced compared with that of the se- 
quential execution (350 million) in spite of the quadruple cache size on 4PEs. On 
the other hand, turb3d has two kinds of loops. The first access is sequential and 
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Fig. 6. Speedups on Sun Ultra 80 
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Fig. 7. L2 Cache Misses on Sun Ultra 80 



it causes conflict misses as show in section 4. However, because second access 
pattern is interleaved, cache performance is better than other three programs. 

Since Ultra 80 used in this evaluation has a single memory bank, memory 
accesses are serialized and the bottleneck of scalability. Therefore, reduction of 
conflict misses to improve the L2 cache performance is important. Speedups by 
OSCAR with padding on Hashed VA are 6.3 times for tomcatv, 9.4 for swim, 4.6 
for hydro2d and 3.4 for turb3d on 4PEs against the sequential execution. Also, 
padding increases the performance of OSCAR without padding 4.7, 5.5, 2.0 and 
1.2 times respectively. The number of cache misses are decreased by padding to 
3.5% of OSCAR without padding for tomcatv, 4.2% for swim, 25% for hydro2d, 
61% for turb3d as shown in Fig. 7. 

Speedups by OSCAR without padding against sequential execution on bin 
hopping are 2.4 times for tomcatv, 3.0 swim, 3.2 for hydro2d and 3.2 for turb3d 
as shown in figure 6(b). These are 1.8, 1.7, 1.4 and 1.1 times better than OSCAR 
without padding on Hashed VA. The reason is that conflict misses assumed on 
virtual address dose not appear on physical address. Speedups by OSCAR with 
padding on bin hopping are 2.5, 3.0, 3.2, 3.0 times for each program and only 
few percentage speedups compared with OSCAR without padding. 
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Fig. 8. Speedups on RS/6000 44p-270 



In this evaluation, the best performance on Ultra 80 is given by OSCAR 
with padding on Hashed VA. Execution times by it are 19 seconds for tomcatv, 
11 seconds for swim, 31 seconds for hydro2d and 56 seconds for turb3d and 
minimum execution times on bin hopping are 47 seconds, 35 seconds, 44 seconds 
and 60 seconds respectively. 

5.2 IBM RS/6000 44p-270 

Fig. 8 shows speedups for 4PEs against sequential execution on IBM RS/6000 
44p-270 with 4-way set associative L2 cache(LRU). Default page placement pol- 
icy of AIX 4.3 is bin hopping and page coloring is supported. 

As shown in Fig. 8(a), speedups by OSCAR with padding against sequential 
execution on bin hopping are 2.6 times for tomcatv, 5.0 for swim, 4.6 for hydro2d 
and 3.2 for turb3d. They are 27%, 4.6%, 2.3%, 0.2% better than OSCAR without 
padding. 

Speedups by OSCAR without padding on page coloring are 1.6, 1.9, 2.9, 3.0 
times against sequential execution and less than on bin hopping. Bin hopping 
show 1.2 times better performance for XLF automatic parallelization, 1.5 times 
better for OSCAR without padding compared with page coloring. However, OS- 
CAR with padding gave us 3.0 times speedup for tomcatv, 7.8 for swim, 4.3 for 
hydro2d and 3.2 for turb3d against sequential execution on bin hopping. Padding 
increases the performance by OSCAR without padding 2.0, 4.1, 1.5 and 1.1 times 
for each program. 

Execution times by OSCAR with padding on page coloring are 23 seconds for 
tomcatv, 8 seconds for swim, 17 seconds for hyclro2d and 25 seconds for turb3d 
and minimum execution times on bin hopping are 27 seconds, 12 seconds, 16 
seconds and 25 seconds for respectively. OSCAR with padding on page coloring 
gave us the best performance on RS/6000 44p-270. 
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6 Conclusions 

This paper has described the cache optimization with data localization for coarse 
grain tasks parallel processing on SMP machine. In the proposed scheme, loops 
are divided into smaller loops to fit the cache and loops accessing the shared 
data are executed on the same processor as consecutively as possible to improve 
temporal locality over different loops. Moreover, cache line conflicts among loops 
are reduced by inter-array padding. 

The proposed scheme is implemented in OSCAR compiler as a core compiler 
of APC compiler developed in the Japan METI Advanced Parallelizing Compiler 
project in a part of Millennium Project IT21[6] and it was evaluated on the 
two commercial SMP workstations having different cache configurations with 
popular page placement policies of operating system. In the evaluation on the 
Sun Ultra 80(4pe) which has 4MB direct map L2 cache, the proposed padding 
scheme gave us 5.9 times speedup against sequential execution at the average of 
4 programs of SPEC CFP95, tomcatv, swim, hydro2d and turb3d, on the default 
page placement policy called Hashed VA. OSCAR with padding on page coloring 
also gave us 4.6 times speedup against sequential execution on the RS/6000 44p- 
270(4pe) having 4MB 2-way set associative L2 cache. The evaluation on two 
multiprocessors shows that OSCAR with padding on page coloring gave us the 
best performance on both machines. 
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Abstract. Recent research results show that conventional hardware- 
only cache solutions result in unsatisfactory cache utilization for both 
regular and irregular applications. To overcome this problem, a number 
of architectures introduce instruction hints to assist cache replacement. 
For example, Intel Itanium architecture augments memory accessing in- 
structions with cache hints to distinguish data that will be referenced in 
the near future from the rest. With the availability of such methods, the 
performance of the underlying cache architecture critically depends on 
the ability of the compiler to generate code with appropriate cache hints. 
In this paper we formulate this problem - giving cache hints to memory 
instructions such that cache miss rate is minimized - as a 0/1 knapsack 
problem, which can be efficiently solved using a dynamic programming 
algorithm. The proposed approach has been implemented in our com- 
piler testbed and evaluated on a set of scientific computing benchmarks. 
Initial results show that our approach is effective on reducing the cache 
miss rate and improving program performance. 



1 Introduction 

Over the last few decades, as the processor performance kept undergoing sub- 
stantial progress, the gap between processor and memory speeds has been widen- 
ing steadily. This problem, known as the “memory wall” problem, exists in both 
general-purpose high-performance computers [13] and embedded systems [17]. 
To bridge this performance gap, cache is introduced which has ameliorated the 
“memory wall” problem to some extent. However, a conventional cache is typ- 
ically designed in a hardware-only fashion, where data management including 
cache line replacement is decided purely by hardware. A consequence of this 
design approach is that cache can make poor decisions in choosing data to be re- 
placed, which may lead to poor cache performance. The widely used LRU (least 
recently used) cache replacement algorithm makes replacement decisions based 
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on past reference behavior. This can cause data with good reuse yield cache 
space to data that comes in later but has poor reuse. Research results reveal 
that considerable fraction of cache lines are held by data that will not be reused 
again before it is displaced from the cache. This is true for both irregular [4] 
and regular applications [15]. This phenomenon, called cache pollution , severely 
degrades cache performance. 

There are a number of efforts in architecture design to address this problem 
and the cache hint mechanism implemented in the Intel Itanium processor [9] is 
one of them. The memory accessing instructions of Itanium can be accompanied 
by a nt (stands for non-temporal ) cache hint. In response, Itanium-2 imple- 
mented a modified LRU replacement algorithm honoring the nt cache hint [9]. 
In the Itanium- 2 processor, the execution of memory accessing instructions with 
nt cache hint differs from that of a normal memory instruction in the following 
way. For a set-associative cache, when a normal memory instruction is executed, 
a cache line is allocated for the accessed data, and the just allocated cache line is 
given the highest rank in the set (to indicate that it is the most recently used). 
Thus it becomes the last to be replaced among all cache lines in the particular 
set. In contrast, the execution of a memory instruction with nt cache hint does 
not change the rank of the touched cache line. In this modified LRU replace- 
ment mechanism, data accessed by instructions with nt hint is more likely to 
be evicted on a subsequent cache miss. By relying on the compiler to give nt 
hint to the instructions accessing data without temporal reuse, this architecture 
effectively prevents cache pollution thus has the potential to achieve better cache 
locality. On this architecture, a good compiler algorithm to generate cache hint 
is essential, which is the focus of this paper. 

Intuitively, two kinds of memory instructions should be given nt hint: (i) 
whose referenced data doesn’t exhibit temporal-reuse, (ii) whose referenced data 
does exhibit temporal-reuse, but it cannot be realized under the particular cache 
configuration. It sounds as though the problem is pretty simple for regular ap- 
plications, and existing techniques for analyzing data reuse [20] and estimating 
cache misses [11, 21, 12] suffice to solve this problem. This plausible statement, 
however, is not true because a fundamental technique used in cache miss esti- 
mation — footprint analysis — is based on the assumption that all accessed 
data compete for cache space equally. However, in our target architecture, mem- 
ory instructions are not homogeneous — those with cache hints have much less 
demand for cache space. This makes the approach derived from traditional foot- 
print analysis very conservative. In summary, the following cyclic dependence 
exists: Cache hint assignment must be known to achieve accurate cache miss es- 
timation, while accurate cache miss estimation is only possible when cache hint 
assignment is finalized. 

In this paper, we develop a simple yet effective formulation to address the 
above problem. Our formulation is based on the observed relationship between 
cache miss rate and cache-residency of reference window [10]. This is used to 
formulate the problem as a 0/1 knapsack problem [8]. For the case that all 
considered memory referencing instructions are enclosed by a perfect loop nest , 
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the formulated problem falls in a special category of knapsack problem that 
can be solved in polynomial time. For case that loops are imperfectly nested, 
this is a general 0/1 knapsack problem, which is known to be NP-complete [8]. 
In this case, good heuristic algorithms exist to achieve near-optimal result [5]. 
However, since the number of references in a loop nest is typically small, even 
obtaining optimal result using a dynamic programming algorithm [8, 14] is quite 
inexpensive. 

We have evaluated the benefit of our approach on reducing cache misses on 
a set of loop kernels and a full SPEC benchmark program by simulating their 
execution using the SimpleScalar simulator [3] . Initial experimental results show 
that our approach reduces the number of data cache misses by up to 57.1%, and 
reduces execution time by up to 27%. 

The rest of the paper is organized as follows. Section 2 briefly reviews the ba- 
sic concepts of data reuse and reference window. Section 3 illustrates, through an 
example, the relationship between reference window and cache miss rate which 
sets up the rationale for our problem formulation. The heart of this paper — 
an elegant knapsack problem formulation — is derived in Section 4. Our imple- 
mentation and experimental results are then presented in Section 5. Section 6 
discusses related work. Section 7 concludes the paper and envisions possible 
future research directions. 

2 Preliminaries 

We review some basic concepts on data reuse and reference window that will be 
used in the rest of this paper. 

For an affine array reference in a loop nest of depth n, the subscripts can be 
represented as H ■ i + c (where H is the access matrix , i is the iteration vector 
and c is the offset vector). If two different executions of an array reference at 
iteration points i\ and *2 access the same array element, it must be true that 
H ■ [i ,2 — if) = 0. Therefore, if the equation H ■ i = 0 has a solution, the array 
reference with subscripts H -i + c exhibits self-temporal reuse and the solution to 
H ■ i — 0 constitutes the self-temporal reuse vector. Two references to the same 
array, with the same access matrix but different offset vectors, say reference 
H ■ i + c7 and reference H ■ i + C 2 , may access the same data only if equation 
H ■ [i\ — * 2 ) = C 2 — cl can be satisfied. Thus group-temporal reuse exists when 
H ■ i = C 2 — ci has a solution, and the solution constitutes the group-temporal 
reuse vector. 

A uniformly generated reference set (UGS) is a set of references of the same 
array, with the same access matrix and has group data reuse within the set [10]. 
By defining uniformly generated reference set and partitioning all array refer- 
ences into UGSs, we can study data reuse on a per-UGS basis. 

Gannon et al’s work introduced the term reference window , which is defined 
as the set of array elements that are accessed by the source reference of a reuse- 
pair in the past and will be accessed in the future by the sink reference [10]. 
Consider the Fortran program shown in Figure 1 as an example. This is a small 
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DO 110 J = 1, M, 4 
DO 110 K = 1, N 
DO 110 I = 1, L 

C(I,K) = C(I,K) + A(I,J) * B(J,K) 

$ + A(I,J+1) * B(J+1,K) + A(I,J+2) * B(J+2,K) 

$ + A(I,J+3) * B(J+3,K) 

110 CONTINUE 

Fig. 1. The MXM loop kernel from SPEC92. Values of M,N and L are 128, 64 and 
256 respectively, A, B and C are two dimensional arrays with 8-byte double precision 
floating-point array elements 



Ref_Win(C(I,K) C(I,K)) 



Ref_Win(A(I,J) ->• A(I,J)) 
Ref_Win(A(I,J+l) -► A(I,J+1)) = 
Ref_Win(A(I,J+2) -*• A(I,J+2)) = 

Ref_Win(A(I,J+3) -► A(I,J+3)) = 

Ref_Win(B(J,K) -> B(J,K)) 
Ref_Win(B(J+l,K) -» B(J+1,K)) = 
Ref_Win(B(J+2,K) -»• B(J+2,K)) = 
Ref_Win(B(J+3,K) -*• B(J+3,K)) = 



(C(l,l), C(l,2) C(1,N), 

C(L,1),C(L,2) ••• C(L,N) } 

(A(1,J), A(2,J), • • • A(L,J)} 
(A(1,J+1), A(2,J+1), A(L,J+1)} 

(A(l,J+2), A(2,J+2), ■ • • A(L,J+2)} 
(A(l,J+3), A(2,J+3), A(L,J+3)} 

(B(J,K)} 

{B(J+1,K)} 

{B(J+2,K)} 

(B(J+3,K)} 



Fig. 2. Reference windows for reuse pairs 



kernel from the SPEC92 benchmark 093.nasa7. Reference windows associated 
with all loop-carried reuse-pairs are listed in Figure 2. 

Let us explain why the reference windows are as given in Figure 2. For ref- 
erence C(I,K) at iteration ( j,k,i ), where j > 1 and j < M, the entire array 
has been traversed by previous iterations, and all the array elements will be 
accessed again, before the loop execution advances to (j + 1, k, i). Therefore the 
reference window is the entire array. For the self-reuse of array reference A(I,J) 
at iteration ( j,k,i ) where k > 1 and k < N, all elements in the first dimension 
will be referenced in the future. Other reference windows given above can be 
derived similarly. 

A careful study reveals that the size of a reference window is determined 
by its reuse vector. By solving reuse equations H ■ i = 0 for each reference, we 
get the reuse vectors for references C(I,K), A(I,J), B(J,K) as (1, 0, 0), (0, 1, 0), 
and (0, 0, 1) respectively. By the definition of reuse vector, we know that C(I,K) 
accesses the same element at iterations (j, fc, i) and (j+ 1, k , i): however, these two 
iterations are far apart, thus the number of different array elements accessed in 
between (i.e., the reference window) is large. While the reuse of B(J,K) happens 
at the innermost loop, its reference window is much smaller. Gannon et al., gave 
a formula to compute the size of reference window based on reuse vector; we 
refer interested readers to [10] for more details. 
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Fig. 3. Cache occupancy and miss rate of array reference A(I,J) in the program 
shown in Figure 1 



3 A Case Study 

In this section, using the matrix multiply program shown in Figure 1, we illus- 
trate the relationship between reference window and cache miss rate. 

First let us analyze the data reuse 1 for this program. We start with the 
array reference A(I,J) , data accessed by this reference at iteration (j, k, i ) is 
A(i,j) and it will be accessed again by the same array reference at iteration 
(j, k + l,i). Intervening data accesses by all array references during this interval 
(from ( j,k,i ) to (j, k + 1,*)) do not interfere with A(i,j). This kind of reuse is 
named self-reuse [20]. Following the reuse analysis method given by Wolf and 
Lam [20] we can easily derive that types of data reuse of all other array references 
of A are self-reuse (there doesn’t exist reuse between references A(I,J+1) and 
A(I,J) since the stride of loop J is 4) . 

Since there does not exist data reuse relation between any two references of 
A, we can study each reference of A(I,J), A(I,J+1), A(I,J+2) and A(I,J+3) 
in isolation. Without loss of generality, we choose the reference A(I,J) and 
profiled its cache behavior. Before giving the profiling result, we define the term 
cache occupancy to refer to the number of cache lines occupied by a particular 
array reference. We traced the cache occupancy and the cache miss rate for the 
reference A(I,J) on a 256-set, 4- way associative cache with a cache line size of 8 
bytes. Both cache occupancy and cache miss rate are shown in Figure 3. In this 
figure, both cache occupancy and cache miss rate are obtained by averaging the 
respective values for the last 20 clock cycles. In the figure cache occupancy of 
the reference varies slightly from 255.5 to 256, while the cache miss rate varies 
widely, from 0% to 100%. 



1 Data reuse is a term different from cache locality; data reuse leads to cache locality 
only when the reuse can be realized by the particular cache configuration. 
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We observe that the cache miss rate is tightly coupled with the cache oc- 
cupancy, and is inversely proportional to cache occupancy. When the average 
cache occupancy of the reference is 256 for the last 20 cycles, the cache miss 
rate is zero during this period. While the cache occupancy reduces to 255.5 - 
255.6 (due to competition with other array references), the cache miss rate rises 
to 100%. This is somewhat surprising, at least initially, as the decrease in the 
cache occupancy is only marginal (from 256 to 255.5). 

Let us go back to the source program and analyze why this happens. As 
we have discussed before, the array element accessed by reference A(I,J) at 
iteration (j, k, i) will be accessed again by the same array reference at iteration 
(j, k+ 1, i). The number of distinct array elements accessed by A(I,J) in between 
(including the two bounding iterations) is 256. These 256 array elements are the 
reference window for the self-reuse vector of A(I,J) that we derived in Section 2. 
Hence we conclude that if the cache holds all elements of the reference window 
for a particular reuse pair, the data-reuse is translated into cache-locality at 
run-time; otherwise, that reuse cannot be exploited by the cache. Based on this 
observation, we formulated the problem of giving nt cache hint in Section 4. 



4 Problem Formulation 

In this section we give a problem formulation for generating nt hint for memory 
instructions. We start with the case that all memory references have self-reuse 
only and give the problem formulation in Section 4.1. The general case that 
includes group-reuse is discussed in Section 4.2. 

4.1 Problem Formulation for Self- Reuse: Case I 

The particular problem that we address in this section is as follows: 



Problem 1. Given a cache size and a perfect loop nest whose loop body has m 
array references with no two references having data reuse between them, deter- 
mine the subset of references that should be given nt hint such that cache miss 
rate of executing this loop nest on the given cache is minimized. 

As demonstrated by the profiling result of matrix multiply program (shown in 
Figure 3), to realize a data reuse, the reference window of that data reuse must 
be accommodated by the cache. In reality, cache size is limited and reference 
windows that it can hold is subject to the cache capacity. We associate each array 
reference with a binary variable 6,; to denote whether it is given nt hint(&j = 0) 
or not(6j = 1), the variables b\ ■ ■ ■ b m constitute all decision variables of the 
problem. The constraint imposed by cache capacity can be formulated as: 

m 

Y |Ref_Win(*)| * bi < C 
2=1 



(1) 
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where |Ref_Win(i)| refers to the size of the reference window of array refer- 
ence i, and C is the effective cache size [11, 18]. We use the effective cache size in- 
stead of full cache size in the capacity constraint since stride access with a stride 
larger than 1 cannot exploit the full cache capacity, as shown in Gao et al’s 
work [11]. 

The capacity constraint ensures that for array reference i whose correspond- 
ing decision variable b% has a value 1, its reference window will be fully accommo- 
dated by the cache. Hence its temporal reuse can be realized. Since our objective 
is to minimize the cache miss rate, it is desirable to have as many array refer- 
ences as possible achieve temporal locality. And since all array references are 
enclosed by a perfect loop nest, their execution frequencies are the same. Thus 
our objective function is formulated as: 



max bi (2) 

»= l 

This problem composed of the constraint specified by Inequality 1 and the 
objective function (specified by Equation 2). This is, in essence, a 0/1 knap- 
sack problem [8]. For the problem formulation that we have given, the knapsack 
problem falls into a special category where the candidate items have different 
weights ( size of the reference window) but the same value( 1). For this special 
case, the knapsack problem can be solved using a greedy algorithm in polyno- 
mial time. We give the details of such an algorithm in [22]. For more complicated 
cases where the loops are imperfectly-nested, the coefficients of bi in the objective 
function will not be uniform, resulting in a more general 0/1 knapsack problem. 
For the general 0/1 knapsack problem, optimal result can be obtained by using 
a dynamic programming algorithm in 0(mC) time [8, 14], where m is the num- 
ber of array references and C is the effective cache size. If the time-complexity 
of the dynamic programming approach is unaffordable, heuristic algorithm also 
exists to obtain near-optimal result [5] . 

4.2 Problem Formulation for Group Reuse: Case II 

Now we extend our approach to the general case that group-reuse exists. The 
problem that we address in this section is: 

Problem 2. Given a cache size and a perfect loop nest whose loop body has m 
array references that have group data reuse, determine the subset of references 
that should be given nt hint such that cache miss rate of executing the loop nest 
is minimized. 

To address this problem, group reuse of these m array references should 
be figured out first. Then we can formulate this problem in a similar way as 
in Case I. Our approach to address this problem is therefore divided into the 
following three steps: 
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1. Partition the array references into UGSs. 

2. Represent the reuse within each UGS using a reuse graph and prune the 

edges of the reuse graph to simplify the problem. 

3. Form a 0/1 knapsack problem from the pruned reuse graph. 

We illustrate these steps by using an example program: 

DO 10 T = 1, IT 
DO 10 I = 1,M 
DO 10 J = 1,N 

L(I,J) = (A(I,J-1) + A(I,J+1) + A(I-1,J) + A(I+1,J)) / 4 

10 CONTINUE 

Step 1. Partitioning: In the first step, we partition array references into UGSs 
such that group reuse exists only within each set. This step is the same as that 
documented in Wolf et al’s paper [21] and Mowry’s dissertation [16]. For the 
example program, the five array references are partitioned into two UGSs: 

Seti = { L(I,J) } 

Set 2 = { A(I,J-1), A(I,J+1), A(I-1,J), A(I+1,J) } 

Step 2. Pruning: The nice feature of the target loops of Problem 1 that we 
dealt with in Section 4.1 is that data reuse is within each single reference, thus 
the cost and benefit of realizing the reuse is clearly defined. The presence of 
group-reuse makes this feature disappear and we have to deal with the case that 
data accessed by one array reference is reused by several other array references. 
We represent group data reuse using a reuse graph (as shown in Figure 4), where 
each edge (solid or dashed) represents a possible reuse. The reuse graph can be 
simplified such that each reference has only one successor and one predecessor. 
In the following paragraph we discuss how to prune the reuse graph. In Figure 4, 
edges remaining after pruning are shown as solid edges and edges that can be 
pruned are shown as dashed edges. For legibility reasons, we did not show all 
pruned edges. However all solid edges that remain after pruning are shown. 

Consider the reuse between A(I+1,J) and A(I,J-1) as an example. Al- 
though reuse testing by solving the reuse equation renders us a reuse edge from 
A(I+1,J) to A(I,J-1), a careful analysis reveals this reuse actually does not 
happen. This is because of the intervening access generated by A(I,J+1). Con- 
sider the location A(i+l,j) accessed by references A(I+1,J) and A(I,J-1). 
The above accesses happen respectively at iteration (i,j) and ( i + 1, j + 1). 
Before this reuse can be realized between these two references, a reuse by ref- 
erence A(I,J+1) happens at iteration (i + 1, j — 1). Hence the reuse edges 
(A(I+1,J), A(I,J+1)) and (A(I,J+1), A(I,J-1)) together, transitively, rep- 
resent the reuse information between A(I+1,J) and A(I,J-1). Therefore the 
edge (A(I+1,J),A(I+1,J-1)) can be pruned. In a similar way, all transitive 
edges can be pruned from the reuse graph. By pruning the transitive edges, we 
get a reuse graph in which each node has at most one successor and one pre- 
decessor. This nice feature of the pruned reuse graph facilitates our knapsack 
problem formulation since the cost and benefit of realizing each reuse can be 
easily identified. 
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Fig. 4. Data reuse graph for Set 2 of the example program. The vector adjacent to 
each reuse edge is the reuse vector 



As seen, for multiple array references that possibly reuse data of a common 
parent, the pruning step chooses the one that reuses the data at the earliest time. 
Thus the rule for pruning is: For an array reference which emanates multiple 
reuse edges, keep the edge that has minimum reuse vector and prune all other 
edges. Reuse vectors are ordered in lexicographic order [1], 

Step 3. Formulation: After pruning we proceed to the last step, viz., formu- 
lating the problem. The cost of realizing a temporal reuse is size of the reference 
window associated with the reuse. By realizing the reuse, the reference reusing 
the data will get its data from cache instead of memory, thus saving a memory 
reference for an iteration. In the pruned reuse graph, the reference window size 
of the four reuse edges emanating from A(I+1,J), A(I,J+1), A(I,J-1) and 
A(I-1,J) are N — 1, 2, N — 1 and (M — 2) • N respectively. L(I,J) has self-reuse 
with reference window size of M ■ N. The problem for the example program can 
be formulated as: 

Maximize: 



bA(I+l,J) + &A(/,J+1) + bA(I,J- 1) + t>A{I-l,J) + 

within the constraint: 

(A — 1) • &A(r+i,j) + 2 • &a(/,j+i) + (A — 1) • &a(/,j-i) + 
(M — 2) • N ■ 6a(/-i,j) + M ■ N ■ < C 



5 Experimental Results 
5.1 Experimental Platform 

We have implemented our approach in the MIPSpro compiler and evaluated 
its performance by running SPEC benchmarks on SimpleScalar simulator [3]. 
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The MIPSpro compiler is the production-quality compiler developed by SGI for 
MIPS processors. We have re-engineered the code generator of the MIPSpro 
compiler to generate code for SimpleScalar. The MIPSpro compiler has a rich 
set of optimizations to maximize program performance. We have enabled most 
of them in our experiment. As a first step of our work, we did not enable loop 
nest transformation in our experiment. Studying the interaction between our 
technique and other locality-enhancing techniques like loop fusion , loop fission, 
loop permutation and loop tiling is our future work. However, optimizations 
applied on loop bodies, like strength reduction, induction variable elimination 
and cross-iteration common subexpression elimination that do not change the 
loop nest structure, are still invoked. 

We implemented the algorithm for computing the reference window size given 
in Gannon et al’s paper [10] which is used in the 0/1 knapsack problem. We 
have also implemented the knapsack problem formulation (i.e., generating the 
constraints) and a greedy algorithm to get the optimal solution for it in the 
MIPSpro compiler. A dynamic-programming algorithm for general 0/1 knapsack 
problem is interesting but was not required since in the benchmarks we evaluated 
perfect loop nests dominate. We did not consider scalar references for nt hint, 
as scalar variables only consume a small portion of cache space. 

SimpleScalar uses MIPS instruction set with a few minor differences. Each 
instruction word in SimpleScalar is of length 64 bits, of which the most signifi- 
cant 16 bits are not used at present. This 16-bit field is called annotation field in 
SimpleScalar, which is used by us to carry cache hint in our experiment. During 
code generation, memory instructions are given nt hint according to the solu- 
tion of the 0/1 knapsack problem. In response to this modification on ISA, we 
modified the simulation mechanism as well. We implemented the modified LRU 
algorithm which does not change the rank a the cache line for accesses with nt 
hint. 

We chose two representative loop kernels, mxm, in which most data accesses 
are column-major, and vpenta, in which most data accesses are row-major. Both 
of them are from SPEC92 093.nasa7 benchmark written in Fortran. Besides, to 
evaluate the effectiveness of our approach on large benchmarks, we also included 
one complete benchmark, tomcatv from SPEC95 with train data set, in our 
workload. We experimented our approach on caches of varying sizes (ranging 
from 4K bytes to 32K bytes) and varying associativity (2- way and 4- way). Note 
that for direct-mapped cache, the replacement algorithm and cache hint do not 
play any role. In our experimental work, we used a fixed cache line size of 16 
bytes. 

5.2 Performance Results 

The cache miss rates of the conventional cache and that of nt hint assisted cache 
are compared in Table 1. The cache miss results of these two types of cache 
are obtained by running exactly the same program generated by our compiler 
on the SimpleScalar simulator (simulating, respectively, the LRU replacement 
algorithm and the modified LRU replacement algorithm). 
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Table 1 . Effectiveness of our approach in reducing cache misses. Column “LRU” 
reports cache miss rates of conventional caches with LRU replacement. Column 
“LRU+hint” report cache miss rates nt hint assisted caches which uses a modifies 
LRU replacement. Column “Red” gives the percentage reduction in cache miss rates 
due to our approach 



Benchmark 


Cache Size 


4K 


8K 


16K 


32K 


LRU 


LRU 

+ 

hint 


Red. 


LRU 


LRU 

+ 

hint 


Red. 


LRU 


LRU 

+ 

hint 


Red. 


LRU 


LRU 

+ 

hint 


Red. 


Result On 4-way Associative Caches 


mxm 


35% 


29.4% 


16% 


35% 


15% 


57.1% 


8% 


8% 


0% 


8%^ 


8% 


0% 


vpenta 


21.7% 


21.3% 


T8%" 


21.6% 


19.7% 


T8%" 


17.2% 


13.5% 


21.5% 


13.2% 


10.9% 


17.4% 


tomcatv 


21% 


21.5% 


-2.4% 


20.9% 


18.5% 


11.5% 


20.1% 


17.1% 


14.9% 


16% 


14% 


12.5% 


average 






5.1% 






25.8% 






12.1% 






10% 


Result On 2-way Associative Caches 


mxm 


35% 


28.7% 


18% 


21.9% 


15.2% 


30.6% 


8.1% 


8%^ 


1.2% 


4%^ 


4%) 


0% 


vpenta 


21.9% 


22.18% 


U73% 


21.7% 


18.9% 


12.9% 


19.5% 


16.3% 


16.4% 


18.6% 


16.2% 


12.9% 


tomcatv 


22.9% 


23% 


-0.4% 


22.5% 


22.6% 


^04^ 


~ 20 %~ 


I8U% 


"R5%" 


16.9% 


14.8% 


12.4% 


average 






tm ; 






14.3% 






9% 






8.4% 



Our approach shows most performance benefits on mxm for 8K byte 4-way 
cache. It reduces the cache miss rate from 35% to 15% (a 57.1% reduction on 
the number of cache misses). As elaborated in Section 3, the key to achieve 
satisfactory overall cache locality is to ensure that reuse of array references of 
A is materialized, since in this example, reference of C has distant reuse and 
references of B are loop-invariant. But, in a conventional cache, cache pollution 
caused by array C prevents array A from enjoying its temporal locality, leading 
to poor locality on a cache of size 4K and 8K bytes. For 8K byte cache, 41.3% of 
the executed memory instructions are given the nt cache hint by our approach. 
This ensures that data accessed by normal memory instructions (references of A 
in this case) stay in the cache for a relatively longer time which in turn results 
in better temporal locality. 

For 2-way 8K byte cache, our approach is also quite effective, reducing the 
number of cache misses in mxm by 30.6%. The percentage reduction achieved 
on a 2- way cache is lower than that achieved by a 4- way cache. Although this 
is counter-intuitive, we observe that, even for the conventional cache with the 
original LRU replacement, mxm achieves lower cache miss rates on a 2- way 8K 
byte cache than on a 4-way cache of the same size. This could be due to higher 
conflict misses as 4-way associativity results in fewer sets (128 sets) than 2-way 
associativity (256 sets) on a 8K byte cache. 

Our approach performs consistently better over conventional cache for larger 
cache sizes (16K and 32K bytes). For caches of relatively smaller sizes (espe- 
cially 4K bytes) , our approach performs marginally better than the conventional 
cache, but not consistently. The reason for this is that when data accessed by 
an instruction with nt hint is brought in, its life time in the cache is typically 
much shorter than that in a conventional cache. Although this is beneficial for 
other data with temporal locality, the short cache life-time of the accessed data 
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(a) Average Number of (b) Average Number of 

References for nt-hint Objects References for Regular Objects 
Fig. 5. Impact of our approach on locality of regular and nt-hint objects 



jeopardizes spatial locality since it may be replaced before the adjacent data 
items are accessed. On a cache of small size, this happens more frequently. 

To verify the above conjecture, we designed an experiment in which each 
cache object is classified as a regular object or an nt-hint object depending 
on whether the data object accessed is brought into the cache using a regular 
memory instruction or with an nt hint memory instruction. We measured the 
number of references for each cache block during its life-time (from the time 
the cache block is brought in to the time it is replaced). Using this we compute 
the average number of references for regular objects and nt-hint objects. We 
compute these values for both classes of objects with the original as well as the 
modified LRU replacement algorithm. Note, in all our experiments the code run 
in the simulator is the same (one which includes nt-hint memory instruction). 
Only the replacement policy used (original LRU or modified LRU) is different 
for the different caches. 

Figure 5(a) shows the average number of references for nt-hint objects for 
tomcatv benchmark. It can be seen that the average number of references remain 
the same between the original and the modified LRU replacement for 32K byte 
cache. However, for small cache sizes, there is a decrease in the average number of 
references. This shows that spatial locality exploited in nt-hint objects is lower in 
nt-hint assisted caches, especially when the cache size is smaller. In other words, 
the locality of the nt-hint objects is really sacrificed. For reference purposes, we 
also show the average number of references for regular objects in Figure 5(b). 
It can be seen that the modified LRU algorithm (with nt hint) improves the 
locality of regular objects in all cache sizes. These two graphs (refer to Figure 5) 
tell us the key to reduce the cache miss ratio on the studied architectures is 
to avoid/minimize the degradation of the locality exploited in nt-hint objects 
while enhancing the locality of exploited in regular objects. Fortunately, for most 
cases the benefits achieved in temporal locality exploited in regular objects by 
our approach dominate the possible loss on spatial locality exploited in nt-hint 
objects. This is evidenced by the positive average reduction on cache misses we 
achieved for all cache sizes we considered. 
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Table 2. Effectiveness of our approach in improving program performance. This table 
shows the normalized execution time of benchmark programs running on a conventional 
cache with LRU replacement (shown in column “LRU”) and on a cache with modified 
LRU replacement (shown in column “LRU+hint”). For each program, execution time 
shown is normalized using the execution time of the program on a conventional 4K 
byte, 4-way associative D-cache 



Benchmark 


Cache Size 


4K 


8K 


16K 


32K 


LRU 


LRU 
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hint 


Red. 


LRU 


LRU 

+ 

hint 


Red. 


LRU 


LRU 

+ 

hint 


Red. 


LRU 


LRU 

+ 

hint 


Red. 


Result On 4-way Associative Caches 


mxm 


1 


0.76 


24% 


1 


0.73 


27% 


0.58 


0.58 


0% 


0.58 


0.57 


1.7% 


vpenta 


1 


1.04 


-4% 


1 


1 


0% 


0.92 


0.85 


TH% 


0.90 


0.78 


13.3% 


tomcatv 


1 


1.03 




1 


0.94 


6% 


0.98 


0.91 


Tl% 


0.89 


0.82 


7.9% 


average 






5.7% 
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TWo 
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Result On 2-way Associative Caches 


mxm 


1 


0.74 


26% 


0.73 


0.72 


0.3% 


0.58 


0.58 


0% 


0.44 


0.44 


0% 


vpenta 


1 


1.04 




1 


0.98 


2% 


0.98 


0.93 


5A% 


0.96 


0.93 


3.1% 


tomcatv 


1.02 


1.07 


-4.9% 


1.01 


1.07 


-5.9% 


0.98 


0.92 


6.1% 


0.93 


0.85 


8.6% 


average 






5.7% 






-1.2% 






3.7% 






3.9% 



We observe that our approach is more effective on caches of higher associa- 
tivity. As shown, our approach reduces the cache miss rate by a larger extent for 
4- way associative caches than for 2- way associative caches. One possible reason 
for this is that our problem formulation does not take cache conflicts into ac- 
count. In our problem formulation given in Section 4, we optimistically assumed 
that the residency of reference windows is only constrained by the effective cache 
size. This assumption gives us a simple problem formulation; but it suffers from 
not considering conflict misses which is non-negligible on caches of low associa- 
tivity. Our future work will consider using conflict-avoiding techniques like data 
padding to improve the effectiveness of our approach. 

Next we report the impact of reduced cache misses (due to nt hint) on 
program performance. For this, we obtain program execution time, expressed 
in execution cycles, from SimpleScalar simulator. We simulate a superscalar 
processor which issues 2 instructions in a clock cycle and employs out-of-order 
instruction issue and out-of-order execution. We consider one level of cache: I- 
cache of 16K bytes, and the size of D-cache varies between 4K and 32K bytes. The 
cache hit latency is 1 cycle, and the cache miss penalty is 40 cycles. Performance 
results for a conventional cache and a cache with nt hints are reported in Table 2. 

We observe that the reduction in cache misses (due to nt hints) does result 
in a corresponding reduction in the execution time, although not exactly by 
the same/similar amount. This is because cache miss rate is not the only fac- 
tor affecting program performance, especially in out-of-order issue processors. 
In general, we observe that the cache miss rate reduction achieved by our ap- 
proach is accompanied by a corresponding performance improvement. With the 
widening speed gap between processor and memory, our approach can have more 
performance impact on future microprocessors. 
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6 Related Work 

Improving cache performance has attracted a lot of attention from both the ar- 
chitecture and compiler perspective. Specifically, enhancing instruction set with 
cache hints is pioneered by Chi and Dietz. They studied an architecture inno- 
vation by introducing cache-bypassing memory instructions [6, 7]. In their ar- 
chitecture model, data accessed by cache-bypassing memory instructions is not 
allocated a cache line. Their approach is helpful to avoid cache pollution, but 
severely compromises spatial locality. By using cache hints, we can get better 
temporal locality without sacrificing the spatial locality significantly. 

Wang et al studied a hypothetical architecture similar to the one considered 
in this paper [19], and proposed a heuristic compiler algorithm for this archi- 
tecture. However our work differs from their work in two major aspects: (i) we 
performed an in-depth study on the compiler algorithm while they focused on the 
architectural implementation; (ii) we presented a systematic formulation while 
they used an ad-hoc algorithm. Lastly, their algorithm does have the cyclic de- 
pendency problem mentioned in Section 1. In a future work, we plan to compare 
our approach with their heuristic method. 

Anantharaman and Pande studied the problem of optimizing loop execution 
on embedded systems with scratch-pad memory and without cache [2] . Interest- 
ingly, they formulated the problem as a 0/1 knapsack problem as well. However, 
the problem they studied is different from ours since scratch-pad memory differs 
from the cache in that it is free of hardware interference. 

7 Conclusions 

Improving cache performance is of significant importance in modern processors. 
In this paper we exploited compiler-assisted cache management which utilizes the 
cache more efficiently to achieve better performance. In particular, we studied the 
problem of determining the subset of references that should be given nt (stands 
for “non-temporal”) cache hints to minimize the cache miss rate. We observe the 
relationship between cache miss rate and cache-residency of reference windows in 
Section 3. This observation forms the basis for our formulation that in order for 
an array reference to realize its temporal reuse, its reference window must be fully 
accommodated in the cache. We then formulated the problem as a 0/1 knapsack 
problem for the following two cases: (i) only self-reuse exists, and (ii) group-reuse 
exists. To the best of our knowledge this is the first systematic formulation of 
this problem. We evaluated our approach by implementing it in a re-engineered 
MIPSpro compiler generating SimpleScalar instructions and running it through 
SimpleScalar simulator. Our simulation results show that our approach exploited 
the architecture potential well. It reduced the number of data cache misses by 
up to 57%, and program execution time by up to 25.7%. Our plan for future 
work includes performing a comprehensive evaluation on the sensitivity of our 
approach to cache associativity and cache line size, integrating our approach 
with other locality-enhancing techniques, and comparing it with related work. 
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Abstract. The accurate modeling of the electronic structure of atoms 
and molecules involves computationally intensive tensor contractions 
over large multi-dimensional arrays. Efficient computation of these con- 
tractions usually requires the generation of temporary intermediate ar- 
rays. These intermediates could be extremely large, requiring their stor- 
age on disk. However, the intermediates can often be generated and used 
in batches through appropriate loop fusion transformations. To optimize 
the performance of such computations a combination of loop fusion and 
loop tiling is required, so that the cost of disk I/O is minimized. In 
this paper, we address the memory-constrained data-locality optimiza- 
tion problem in the context of this class of computations. We develop 
an optimization framework to search among a space of fusion and tiling 
choices to minimize the data movement overhead. The effectiveness of 
the developed optimization approach is demonstrated on a computation 
representative of a component used in quantum chemistry suites. 



1 Introduction 

An increasing number of large-scale scientific and engineering applications are 
highly data intensive, operating on large data sets that range from gigabytes to 
terabytes, thus exceeding the physical memory of the machine. 

Scientific applications, in particular electronic structure codes widely em- 
ployed in quantum chemistry [12, 13], computational physics, and material sci- 
ence, require elaborate interactions between subsets of data; data cannot be 
simply brought into the physical memory once, processed, and then over-written 
by new data. Subsets of data are repeatedly moved back and forth between 
a small memory pool, limited physical memory, and a large memory pool, the 
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unlimited disk. The cost introduced by these data movements has a large impact 
on the overall execution time of the computation. In such contexts, it is neces- 
sary to develop out-of-core algorithms that explicitly orchestrate the movement 
of subsets of data within the memory-disk hierarchy. These algorithms must en- 
sure that data is processed in subsets small enough to fit the machine’s main 
memory, but large enough to minimize the cost of moving data between disk 
and memory. 

This paper presents an approach to the automated synthesis of out-of-core 
programs with particular emphasis on the Tensor Contraction Engine (TCE) 
program synthesis system [1, 3, 2, 5, 4]. The TCE targets a class of electronic 
structure calculations, which involve many computationally intensive compo- 
nents expressed as tensor contractions (essentially generalized matrix products 
involving higher dimensional matrices). Although the current implementation 
addresses tensor contraction expressions arising in quantum chemistry, the ap- 
proach developed here has broader applicability; we believe it can be extended 
to automatically generate efficient out-of-core code for a range of computations 
expressible as imperfectly nested loop structures operating on arrays potentially 
larger than the physical memory size. 

The evaluation of such expressions involves explicit decisions about: 

— the structure of loops, including tiling strategies 

— the evaluation order of intermediate arrays 

— memory operations (allocate, deallocate, reallocate) 

— disk operations (read, write) 

The fundamental compiler transforms that we apply here are loop fusion and 
loop tiling: 

— Loop Fusion: The evaluation of the tensor contraction expressions often 
results in the generation of large temporary arrays that would be too large 
to be produced entirely in main memory by a “producer” loop nest and then 
consumed by a “consumer” loop nest. By suitably fusing common loops in the 
producer and consumer loop nests, it is feasible to reduce the dimensionality 
of the buffer array used in memory and store the intermediate array on disk. 
Thus, a smaller in-memory array may be used to produce the full disk array 
in chunks. 

— Loop Tiling: It enables data locality to be enhanced, so that the cost of 
moving data to/from disk is decreased. 

For minimizing the disk access cost under a given memory constraint, the 
compiler needs to search among many possible loop fusion structures, tile sizes, 
and placements of temporaries on disk. Conceptually, it is necessary to search 
along all three dimensions simultaneously. A decoupled approach that first 
searches for a fusion structure that minimizes the memory usage, followed by 
tiling and disk placements [5] , may produce code with a sub-optimal disk-access 
cost as an example in the next section illustrates. A decoupled approach that 
first optimizes disk access by tiling the loops and placing temporaries on disk, 
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followed by loop fusion for reducing memory usage, may fail to find a solution 
that fits into memory since the constraints imposed by tiling prohibit many 
possible fusions. However, a simultaneous search along all three dimensions is 
computationally infeasible. 

In this paper, we present an integrated approach in which we first search 
for possible fusion structures together with disk placements. The result of this 
search is a set of candidate loop structures with different memory requirements 
and different combinations of disk placements for the temporaries. For each of 
the solutions from this search we then search for the tile sizes that minimize the 
disk access cost [6]. We present two algorithms for the combined fusion and disk 
placement search: an optimal algorithm that is guaranteed to find the solution 
that will have the minimum disk access cost after tiling and a heuristic algorithm 
that is more efficient but may result in a suboptimal solution after tiling. 

The rest of the paper is organized as follows. In the next section, we discuss 
the class of computations that we consider and discuss an example from compu- 
tational chemistry. In Sec. 3, we introduce the main concepts used in this paper. 
Sec. 4 presents an optimal fusion plus tiling algorithm. Sec. 5 presents a sub- 
optimal, but empirically efficient fusion plus tiling algorithm. Sec. 6 presents 
experimental evidence of the performance of this algorithm, and conclusions are 
provided in Sec. 7. 

2 The Computational Context 

We consider the class of computations in which the final result to be computed 
can be expressed in terms of tensor contractions, essentially a collection of multi- 
dimensional summations of the product of several input arrays. There are many 
different ways to compute the final result due to commutativity, associativity and 
distributivity. The ways to compute the final result could differ widely in the 
number of floating point operations required, in the amount of memory needed, 
and in the amount of disk-to-memory traffic. 

As an example, consider a transformation often used in quantum chemistry 
codes to transform a set of two-electron integrals from an atomic orbital (AO) 
basis to a molecular orbital (MO) basis: 

B(a,b,c,d)= ^ Cl(d,s) x C2(c,r) x C3(b,q) x Cl(a,p) x A(p,q,r,s). 

p,q,r,s 

Here, A(p, q , r, s) is an input four-dimensional array (assumed to be initially 
stored on disk), and B(a, b, c, d ) is the output transformed array, which needs to 
be placed on disk at the end of the calculation. The arrays Cl through C 4 are 
called transformation matrices. 

The indices p , q , r, and s denote the total number of orbitals, and have the 
same range N equal to O + V, where O is the number of occupied orbitals in 
the chemistry problem and V is the number of unoccupied (virtual) orbitals. 
Similarly, the indices a, b, c, and d have the same range equal to V. Typical 
values for O range from 10 to 300, and the number of virtual orbitals V is 
usually between 50 and 1000. 
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The result array B is computed in four steps to reduce the number of 
floating point operations from 0(V 4 N 4 ) in the initial formula (8 nested loops, 
for p, q , r, s , a , b , c, and d) to 0(VN 4 ) as shown below: 

B{a , b, c, d) = Cl(d, s) 

S 

x ( '52 C2 ( c ’ r ) x \ x (^2 CA ( a ’P) x A (p,<h r , s ) 

The result of this operation-minimal approach is the creation of three temporary 
intermediate arrays Tl, T 2, and T 3 as follows: Tl(a, q, r, s) = J2 P C4 (a,p)A(p, q, 
r, s), T2(a,b,r,s) = Y Jq C3(b, q)Tl(a, q, r, s), and T3(a,b,c,s) = J2 r C2 ( c ,r) 
T2(a, 6, r, s). Assuming that the available memory limit on the machine running 
this calculation is less than V 4 (which is 3TB for V = 800), any of the logical 
arrays A, Tl, T 2, T3, and B is too large to entirely fit in memory. Therefore, if 
the computation is implemented as a succession of four independent steps, the 
intermediates Tl, T 2, and T3 have to be written to disk once they are produced, 
and read from disk before they are used in the next step. Furthermore, the 
amount of disk access volume could be much larger than the total volume of the 
data on disk containing A, Tl, T 2, T 3, and B. Since none of these arrays can 
be fully stored in memory, it may not be possible to perform all multiplication 
operations by reading each element of the input arrays from disk only once. 

We use loop fusion to reduce the memory requirements for the temporary 
arrays and loop fusion together with loop tiling to reduce the disk access volume. 
For illustrating the interactions between fusion and tiling consider the following 
simple example with only two contractions: 

Bij = ^ ) Aik ^ I ^ ) Bu x Cji 

k V l 

To prevent the intermediate array t[k,j] = B^i x Cji from having to be 
written to disk in case it does not fit in memory, we need to fuse loops between 
the producer and the consumer of t[k,l]. This results in the intermediate array 
being formed and used in a pipelined fashion. For every loop that is fused between 
the producer and the consumer of an intermediate, the corresponding dimension 
can be removed from the intermediate. E.g., in the loop structure in Fig. 1(a), 
the intermediate t[k, l } could be reduced to a scalar, while in the loop structure 
in Fig. 2(a), it could only be reduced to a vector t[k\. 

Notice that for reducing the memory requirements of the temporary to a 
scalar in Fig. 1(a), it is necessary to have the file read operations for B and C 
inside the innermost loop. This results in the input arrays to be read redundantly 
multiple times. In this example, B is read once for every iteration of the j loop, 
while C is read once for every iteration of the k loop. 

The number of redundant read operations can be reduced by tiling the loops 
and reading entire tiles in one operation as illustrated in Fig. 1(b). B , e.g., is now 
only read redundantly once for every iteration of the jT tiling loop. In exchange, 
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FOR i, j 
D[i, j] = 0 
FOR j, k 
t = 0 
FOR 1 

t += C[j ,1] * B [k, 1] 

FOR i 

D[i, j] += A [i ,k] * t 

(a) Memory minimal loop structure 



FOR iT, jT 

Initialize(D [il , jl] ,0.0) 
Write D[il,jl] to D[iT + 
FOR jT, kT 

Initialize (t [j I ,kl] ,0.0) 
FOR IT 

C[jl,ll] = Read C[jT + 
B [kl ,11] = Read B [kT + 
FOR jl, kl, II 

t [j I ,kl] += C[jl,ll] 
FOR iT 

D[il, jl] = Read D [iT + 
A [il ,kl] = Read A [iT + 
FOR jl, kl, il 

D[il, jl] += A [il ,kl] 
Write D[il,jl] to D[iT 



il, jT + jl] 



jI.1T + II] 
kl , IT + II] 

* B [kl , II] 

il » jT + jl] 
il ,kT + kl] 

* t_2 [jl ,kl] 

+ il, jT + jl] 



(b) Tiled loop structure 
Fig. 1. Illustration of the decoupled approach for a simple example 



FOR j 
FOR k 

t [k] = 0.0 
FOR 1 

C = Read C[j,l] 
FOR k 

B = Read B[k,l] 
t [k] += B * C 
FOR i 
D = 0.0 
FOR k 

A~= Read A[i,k] 
D += A~* t [k] 
Write D to D[i,j] 



(a) Best loop structure with 
temporary in memory 



FOR jT 

FOR kT, jl, kl 

t [kT + kl,jl] =0.0 
FOR IT 

C[jl,ll] = Read C [ j , 1] 

FOR kT 

B [kl ,11] = Read B[k,l] 

FOR jl, II, kl 

t [kT + kl,jl] += B [kl , II] * C[jl,ll] 

FOR iT 

FOR jl, il 

D[il, jl] = 0.0 
FOR kT 

A[il,kl] = Read A[i,k] 

FOR jl, il, kl 

D[il, jl] += A [il ,kl] * t [kT + kl,jl] 
Write D[il,jl] to D[i,j] 



(b) Tiled loop structure 



Fig. 2. Illustration of the integrated approach for a simple example 



the memory requirement increases since all fused array dimensions get expanded 
to tile size. The disk access volume for a given loop structure can, therefore, be 
minimized by increasing the tile sizes until the memory is exhausted. 

In our previous decoupled approach to fusion and tiling, we first fused the 
loops in order to minimize the memory usage. The memory- minimal loop struc- 
ture was then tiled to minimize the disk access cost, as shown in Fig. 1. We found 
that for some examples, this resulted in suboptimal solutions, since there were 
too many redundant read operations for the input arrays. Also, the memory- 
minimal loop structure often results in the summation loop being the outer- 
most loop for a contraction. This requires the initialization of the result array 
to be outside the non-summation tiling loops, which then requires both a read 
and a write operation for the result array. This is illustrated with array D in 

Fig. 1(b). 
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Minimizing the disk access cost before fusion by deciding which temporaries 
to put on disk is not possible, since the resulting constraints on the loop struc- 
ture might prevent the solution from fitting in memory. Also, since fusion can 
eliminate the need of writing some temporaries to disk, it can help reduce the 
disk access cost. What is, therefore, needed is an integrated approach in which 
we minimize the disk access cost under a memory constraint. The loop structure 
in Fig. 2 is the result of such an integrated approach. 

It is not feasible, to simultaneously search for all possible loop structures and 
all possible tile sizes. Instead, we first produce a set of candidate loop structures 
and decide which of the temporaries are written to disk for a given loop structure. 
For each candidate solution in this set, we then determine the tile sizes that 
minimize the disk access cost. Finally, we select the tiled loop structure with the 
minimal disk access cost. We have previously described the tile size search and 
the proper placement of I/O operations in the tiled loop structure [6]. In this 
paper, we concentrate on the algorithms for finding the candidate solutions for 
the tile size search. 

3 Preliminaries 

Before describing the algorithms, we first need to present the notions of expres- 
sion trees, fusions, and nestings. Since these concepts, as well as the algorithms, 
are not limited to tensor contraction expressions, we describe them in the context 
of arbitrary sums-of-products expressions. For more detailed explanation, read- 
ers are referred to [7, 8, 9, 10, 11]. As an example to illustrate the concepts, we 
use the multi-dimensional summation shown in Figure 3(a) represented by the 
expression tree in Figure 3(b). One way to fuse the loops is shown in Figure 3(c). 

Indexset Sequence. To describe the relative scopes of a set of fused loops, we 
introduce the notion of an indexset sequence , which is defined as an ordered list 
of disjoint, non-empty sets of loop indices. For example, / = ({'<, fc}, {)}) is an 
indexset sequence. For simplicity, we write each indexset in an indexset sequence 
as a string. Thus, / is written as ( ik,j ). Let g and g' be indexset sequences. We 
denote by Set(g) the union of all indexsets in g , i.e., Set(g) = Ui<r<|g| 9 M- For 
instance, Set(f) = Set({j,i,k)) = {i,j,k}. 

Fusion. We use the notion of an indexset sequence to define a fusion. Intuitively, 
the loops fused between a node and its parent are ranked by their fusion scopes 
in the subtree from largest to smallest; two loops with the same fusion scope 
have the same rank (i.e., are in the same indexset). In the example, the fusion 
between B and fa is ( k,jl ). 

Nesting. Similarly, a nesting of the loops at a node v can be defined as an 
indexset sequence. Intuitively, the loops at a node are ranked by their scopes in 
the subtree; two loops have the same rank (i.e., are in the same indexset) if they 
have the same scope. In the example, the loop nesting at fa is ( k,jl ) (because 
the fused fc-loop covers one more node, namely C). 
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W[k\ = EEE (A[i,j]x B\j,k,l]xC[k,l}) 

i j l 



(a) A multi-dimensional summation 



h 




A[i,j\ h x 




B\j,k,l\ C[k,l } 

(b) An expression tree for computing (a) 
(c) A loop fusion configuration for (b) 



Initialize fl[j] 
for i 
for j 

I" A = Read A [i , j] 
[flCj] += A 
Initialize f5[k] 
for k 
'for 1 

[ C[l] = Read C[k,l] 
for j 

[ Initialize f3 
for 1 

I" B = Read B[j,k,l] 
f2 = B X C [1] 
f 3 += f 2 
f4 = fl[j] Xf 3 
L f 5 [k] += f4 



Fig. 3. An example multi-dimensional summation 



The “More-Constraining” Relation on Nestings. A nesting h at a node v 
is said to be more or equally constraining than another nesting h! at the same 
node, denoted h C h', if any loop fusion configuration for the rest of the ex- 
pression tree that works with h also works with h! . This relation allows us to 
do effective pruning among the large number of loop fusion configurations for 
a subtree. 

4 Optimal Fusion + Tiling Algorithm 

We derive the memory usage and the disk access volume of arrays in tiled, 
imperfectly nested loops as follows. Without tiling, the memory usage of an 
array is the product of the ranges of its unfused dimensions. With tiling, the 
tile sizes of the fused dimensions also contribute to the product. The disk access 
volume is the size of the array times the trip counts of the loops surrounding 
the read/ write statement but not corresponding to the dimensions of the array. 
Without tiling, the trip counts of such extra loops are simply their index ranges. 
With tiling, the trip counts become their index ranges divided by their tile sizes. 
In addition, if partial sums are produced and written to disk, they need to be 
read back into memory, thus doubling the disk access volume. 

MemUsage(A, f) = n T i>< n n * 

i<=FusedDimens(A,f) ie UnfusedDimens(Aj) 

DiskCost(A, f) = WriteFactor(A, f) x Ni x Ni/Ti 

i^A. dimens ieExtraLoops(Aj) 
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where 



WriteFactor(A , /) 

FusedDimens(A, f) 
UnfusedDimens(A, f) 
ExtraLoops(A , /) 



{ 2 if / is the fusion between produce A and write A 
and A. dimens C Set(f) 

1 otherwise 

A. dimens D Set(f) 

A. dimens — Set(f) 

Set(f) — A. dimens 



and / is the fusion between read A and consume A, between produce A and 
consume A, or between produce A and write A. 

As an example, for a disk-resident array X[i,j,k], if the fusion between 
produce X and write X is g = ( ij ), then we have from the above equations: 

FusedDimens(X , g) = {i,j} 

UnfusedDimens(X, g ) = {fc} 

MemUsage(X, g) = Ti x Tj x Nk 

WriteFactor(X , g) = 1 

ExtraLoops(X , g) = 0 

DiskCost(X, g) = Ni x Nj x Nk 



Note that if an intermediate array is written to disk, it would have two 
potentially-different MemUsage: one for before writing to disk and one after 
reading back from disk. Similarly, it would have two DiskCost: one for writing 
it and one for reading it. 

Since MemUsage and DiskCost depend on tile sizes, it may appear we cannot 
compare MemUsage and DiskCost between different fusions without knowing the 
tile sizes. However, some comparison is still possible. Continuing with the above 
example, if the fusion between produce X and write X is g' = ( il ), then: 

MemUsage(X, g 1 ) = Ti x Nj x Nk 
DiskCost(X, g') = Ni x Nj x Nk x Ni/Ti 

No matter what tile sizes are used for g\ we can use the same tile sizes for g 
and assure that MemUsage(X, g) < MemUsage(X, g') and DiskCost(X, g) < 
DiskCost(X, g') because Tj < Nj and Ni/Ti > 1. Hence, fusion g' for array X is 
inferior to fusion g and can be pruned away. 

Generalizing from this example, we obtain the sufficient conditions for a fu- 
sion to result in less or equal MemUsage or DiskCost than another one. 

LeqMemUsage{A, f , f') = FusedDimens(A, f) 5 FusedDimens(A, f') 
LeqDiskCost(A, f, f') = ExtraLoops(A, f) C ExtraLoops(A, f') 
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The first condition above implies UnfusedDimens(A, f) C 
UnfusedDimens(A, f') and hence MemUsage(A, f) < MemUsage(A, /') for 
same set of tile sizes because Ti < Ni for any index i. Similarly, the second 
condition above (for LeqDiskCost ( A,f,f )) implies WriteFactor(A, f) < 
WriteF actor(A, f) and DiskCost(A, f) < DiskCost(A, f) for same set of tile 
sizes because Ni/Ti > 1 for any index i. 

In our example, both LeqMemUsage(X, g, g 1 ) and LeqDiskCost(X, g, g') are 
true because FusedDimens(X , g) = {*, j} is a superset of FusedDimens(X , g') = 
{*} and ExtraLoops(X , g) = 0 is a subset of ExtraLoops(X, g') = {?}. 

To apply LeqMemUsage and LeqDiskCost to compare different solutions cor- 
responding to different fusion configurations for a subtree, we need to consider 
the different combinations of whether each array is disk-resident or not. 



LeqMemUsage(s, s') = 

V array A in the subtree rooted at s.root , 

{ 

LeqMemUsage(A , s.A.f r , s' .A.f r ) and 

LeqMemUsage(A, s.A.f w , s' .A.f w ) if s.A.ondisk and s' .A.ondisk 

FusedDimens( A, s.A.f r ) D FusedDimens(A, s' .A.f c ) and 

FusedDimens(A, s.A. f w ) D FusedDimens(A, s' .A.f c ) if s.A.ondisk and not s .A.ondisk 



LeqMemUsage{A, s.A.f c , s' .A.f r ) and 

LeqMemUsage(A, s.A.f c , s' .A.f w ) if not s.A.ondisk and s' .A.ondisk 



LeqMemUsage(A, s.A.f c , s' .A.f c ) 



} 

LeqDiskCost(s , s 7 ) = 

V array A in the subtree rooted at s.root , 

{ 

LeqDiskCost(A , s.A.f r , s' .A. f r ) and 
LeqDiskCost(A, s.A.f w , s' .A. f w ) 



if not s.A.ondisk and 
not s' .A.ondisk 



if s.A.ondisk and s' .A.ondisk 



not s.A.ondisk 

} 



otherwise 



where 



s.A.ondisk means array A is disk-resident in solution s 



s.A.f r is the fusion between read A and consume A in solution s 
s.A.f w is the fusion between produce A and write A in solution s 
s.A.f c is the fusion between produce A and consume A in solution s 



For input or final-result arrays where fusions f r or f w do not apply, or for inter- 
mediate disk-resident arrays where fusion f r is yet to be decided, such fusions 
are considered empty sets. 

Making use of the above results, we can compare and prune solutions as 
follows. A solution that has higher or equal memory usage and disk access cost 
and a more or equally constraining nesting than another solution is considered 
inferior and can be pruned away safely. Between solutions for the entire tree and 
between solutions for a subtree whose root array is disk-resident and its fusion f r 
is undecided, pruning without the condition of a more or equally constraining 
nesting is also safe. 
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Inferiors ' , s) = 

LeqMemUsage(s, s') and 
LeqDiskCost(s, s') and 
(s' .root. nesting C s.root.nesting or 
s.root = Root or 

(s.root.ondisk and s' .root.ondisk and s. root.fr = s' .root. f r = 0)) 

A dynamic programming, bottom-up algorithm using the Inferior condition 
as a pruning rule works as follows. For each leaf node (corresponding to an input 
array) in the tree, one solution is formed for each possible fusion f r (or f c if it 
is not disk-resident) with its parent and then inferior solutions are pruned away. 
For each intermediate array A in the tree, all possible legal fusions f w and / c , 
for writing A to disk or not respectively, are considered in deriving new solutions 
from the children of A. Solutions that write A to disk are pruned against each 
other before all possible legal fusions f r are enumerated to derive new solutions. 
Then all inferior solutions for the subtree rooted at A, whether writing A to 
disk or not, are pruned away. For the root of tree, if it is to be written to disk, 
all possible legal fusions f w are considered in deriving new solutions. Finally, all 
inferior solutions for the entire tree are pruned away. 

Although this approach is guaranteed to find an optimal solution, it could 
be expensive. The reason is the condition LeqMemUsage(s, s') requires each and 
every array in the subtree in solution s to have lower or equal memory usage 
than the corresponding array in solution s' , and similarly for LeqDiskCost(s , s') 
in terms of disk access cost. If either the memory usage or the disk access cost 
of any array in s is incomparable to the corresponding array in s', no solution 
derived from s for a larger subtree would be comparable to any solution derived 
from s' . Thus, in the worse case, the number of unpruned solutions for the entire 
tree could grow exponentially in the number of arrays. Due to its exponential 
complexity, we have yet to implement this approach. 



5 Efficient Fusion + Tiling Algorithm 

Since the optimal fusion and tiling algorithm is impractical to implement, due to 
its large number of unpruned solution, we have devised a sub-optimal, efficient 
algorithm to solve the fusion and tiling problem. The central idea of this algo- 
rithm is to first fix a tile size T common to all the tiled loops, and, based on this 
tile size, determine a set of candidate solutions by a bottom-up tree traversal. In 
the second part of the algorithm, the tile sizes are allowed to vary, and optimal 
tile sizes are determined for all candidate solutions. The candidate solution with 
the lowest disk cost is finally chosen as the best overall solution. 

Our current implementation of the first part of the algorithm uses T = 
1. With WriteFactor(A, /), FusedDimens(A, f), UnfusedDimens(A, /), and 
ExtraLoops(A 1 f) defined according to Section 4, the memory usage and disk 
cost for an array A become: 
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MemUsage(A, f) = Ni 

ie UnfusedDimens(Aj) 

DiskCost(A, f) = WriteFactor(A, f) x Ni 

ieSet(f) 

where / is the fusion between read A and consume A, between produce A and 
consume A, or between produce A and write A. 

When an intermediate array is stored on disk, it has two MemUsage: one 
for before writing to disk and one after reading back from disk. In this case, we 
define MemUsageas the maximum of the two values. Similarly, the array has two 
DiskCost : one for writing it and one for reading it. We define the total disk cost 
of an intermediate array that is stored on disk as the sum of the disk costs for 
writing it and for reading it back. 

With these definitions, we calculate the memory usage and disk cost of a so- 
lution s corresponding to a given fusion configuration for a subtree: 

MemUsage(s) = MemUsage(A, f s ) 

Ain the subtree rooted at s.TOOt 

DiskCost(s) = DiskCost(A, f s ) 

Ain the subtree rooted at s.TOOt 

where f s is the fusion between read A and consume A, between produce A and 
consume A, or between produce A and write A given the fusion configuration of 
the solution s. 

Different solutions corresponding to different fusion configurations for a sub- 
tree are now easily comparable: 

LeqMemUsage(s, s') = {MemUsage(s) < MemUsage(s')} 

LeqDiskCost(s, s') = {DiskCost(s) < DiskCost(s')} 



Making use of the above results, we can introduce pruning rules similar to 
those of the optimal algorithm: a solution that has higher or equal memory usage 
and disk access cost and a more or equally constraining nesting than another 
solution is considered inferior and can be pruned away safely. 

Inferiors' , s) = 

LeqMemUsage(s, s') and 
LeqDiskCost(s, s') and 
(s. root. nesting C s' .root.nesting or 
s.root = Root or 

( s.root.ondisk and s’ .root.ondisk and s. root.fr = s' .root. f r = 0)) 

A dynamic programming, bottom-up algorithm using the Inferiorc ondition 
as a pruning rule works in the same fashion as the optimal algorithm described in 
Section 4. The major difference between the optimal algorithm and the efficient 
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FOR r, s 
FOR a, q 

Tl[a,q] =0.0 
FOR p 

C4 [a] = Read C4 [p , a] 

FOR q 

A~= Read A[p,q,r,s] 

FOR a 

Tl[a,q] += A~* C4[a] 

FOR b 
FOR a 

T2 [a] =0.0 
FOR q 

C3 = Read C3[q,b] 

FOR a 

T2 [a] += Tl[a,q] * C3 
Write T2 to T2[a,b,r,s] 



FOR s 
T3 = 0.0 
FOR r 

C2 = Read C2[r,c] 

FOR s 

T2 = Read T2[a,b,r,s] 
T3 += T2 * C2 
FOR d 
B = 0.0 
FOR s 

Cl = Read Cl[s,d] 

B += T3 * C2 
Write B to B[a,b,c,d] 



Fig. 4. Fused Structure with temporary T2 on disk 



algorithm is that the Inferiors, s') condition is more relaxed in the latter: we 
no longer require that the MemUsage and DiskCost inequalities be valid for 
all individual arrays in the subtree rooted at s.root. Instead, only the sums of 
MemUsage and DiskCost over the entire subtree need to be compared. 

The result of this approach is a set of candidate solutions that are character- 
ized by pairs of the form (MemUsage(s ) , DiskCost(s)). The algorithm described 
above prunes away all solutions that have higher MemUsage and DiskCost un- 
der the tile size constraint T = 1. For each candidate solution in the set, we then 
search for the tile sizes that minimize the disk access cost [6] . Increasing the tile 
sizes causes the disk access cost to decrease and the memory usage to increase, 
since array dimensions that have been eliminated by fusion get expanded to tile 
size. Finally, we select the solution with the least disk access cost. 

6 Experimental Evaluations 

We used the algorithm from Sec. 5 to generate code for the AO-to-MO index 
transformation calculation described in Sec 2. The algorithm generated 77 can- 
didate solutions that would then be run through the tiling algorithm. We present 
two representative solutions generated by this algorithm. 

The solution shown in Fig. 4 places only temporary T 2 on disk, while the 
solution shown in Fig. 5 places only the temporary T1 on disk. After tile size 
search, the tiled code with the least disk access cost was the one based on the 
solution with T 2 on disk. The optimal code is shown in Fig. 6. 

Measurements were taken on a Pentium II system with the configuration 
shown in Table 1. The codes were all compiled with the Intel Fortran Compiler 



Table 1. Configuration of the system whose I/O characteristics were studied 



Processor 


OS 


Compiler 


Memory 


Hard disk 


Pentium II 300 MHz 


Linux 2.4.18-3 


gcc version 2.96 


128MB 


Maxtor 6L080J4 
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FOR q, r, s 
FOR a 

T1 [a] =0.0 
FOR p 

A~= Read A[p,q,r,s] 

FOR a 

C4 = Read C4[p,a] 

T1 [a] += A~* C4 
Write T1 [a] to Tl[a,q,r,s] 



Fig. 5. 



FOR b 

C3[q] = Read C3[q,b] 

FOR a 
FOR s , c 

T3[s,c] = 0.0 
FOR r 

C2[c] = Read C2[r,c] 

FOR s 
T2 = 0.0 
FOR q 

T1 = Read Tl[a,q,r,s] 
T2 += T1 * C3[q] 

FOR c 

3 [s , c] += T2 * C2[c] 

FOR d 
FOR c 

B [c] = 0.0 
FOR s 



Cl = Read Cl[s,d] 

FOR c 

B [c] += T3[s,c] * Cl 
Write B[c] to B[a,b,c,d] 

Fused Structure with temporary T1 on disk 



FOR rT, sT 

FOR aT, qT, rl, si, al, ql 
Tl[aT+aI,qT+qI,rI,sI] =0.0 
FOR pT 

C4[pI,aT+aI] = Read C4[p,a] 

FOR qT 

A [pi ,ql ,rl ,sl] = Read A[p,q,r,s] 

FOR aT, rl, si, pi, ql, al 
T1 [aT+al ,qT + ql,rl,sl] += 
A[pl,ql,rl,sl] * C4[pI,aT+aI] 

FOR bT 

FOR aT, rl, si, bl, al 
T2 [aT+al , bl , r I , si] = 0.0 
FOR qT 

C3[ql,bl] = Read C3[q,b] 

FOR aT, rl, si, bl, ql, al 
T2 [aT+al, bl,rl, si] += 

T1 [aT+al , qT+ql , r I , si] *C3 [ql , bl] 
Write T2 [aT+al ,bl ,rl , si] to 
T2 [a,b,r ,s] 



FOR aT, bT, cT 

FOR sT, al, bl, cl, si 
T3[al,bl,cl] =0.0 
FOR rT 

C2[rl,cl] = Read C2[r,c] 

FOR sT 

T2[al,bl,rl,sl] = 

Read T2[a,b,r,s] 

FOR al, bl, cl, rl, si 
T3[al,bl,cl] += 

T2 [al ,bl ,rl , si] *C2 [rl , cl] 

FOR dT 

FOR al, bl, cl, dl 
B [al ,bl , cl ,dl] = 0.0 
FOR sT 

Cl [dl , si] = Read Cl[d,s] 

FOR al, bl, cl, dl, si 
B [al ,bl , cl ,dl] += 

T3 [al , bl , cl] *C2 [dl , si] 

Write B [al ,bl , cl ,dl] to B[a,b,c,d] 



Fig. 6. Loop Structure after tiling 



for Linux. Although this machine is now very old and much slower than PCs 
available today, it was convenient to use for our experiments in an uninterrupted 
mode, with no interference to the I/O subsystem from any other users. 

Table 2 shows the measured I/O time for the AO-to-MO transform where the 
sizes of the tensors (double precision) considered were: N p = N q = N r = N a = 80 
and N a = N}, = N c = Nd = 70. We used 100 MB as the memory limit. The I/O 
time for each array was separately accumulated. The predicted values match 
quite well with the measured time. The match is better for the overall I/O time 
than for some individual arrays. This is because disk writes are asynchronous 
and may be overlapped with succeeding disk reads — hence the measurements 
of I/O time attributable to individual arrays is subject to error due to such 
overlap, but the total time should not be affected by the interleaving of writes 
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with succeeding reads. For these tensor sizes and an available memory of 100MB, 
it is possible to choose fusion configurations so that the sizes of any two out of 
the three intermediate arrays can be reduced to fit completely in memory, but 
it is impossible to find a fusion configuration that fits all three intermediates 
within memory. Thus, it is necessary to keep at least one of them on disk, and 
incur disk I/O cost for that array. 

Table 3 shows the predicted I/O times and the improvement factor of the 
integrated fusion+tiling algorithm over the decoupled algorithm for the AO-to- 
MO transformation example for different array sizes and memory limits. For 
the arrays sizes a = 70 and p = 80, actual measurements were performed using 
the 100 MB, 500 MB, and 2000 MB memory limits and, in all cases, for the 
integrated algorithm, the predicted results matched the actual results. For the 
memory limits of 500MB and 2000M1? and the small array sizes, both the 
decoupled and the integrated algorithm were able to fit all the temporaries in 
memory, and thus no significant improvement was achieved. 

Depending on the size of the problem, as the memory pressure increases, the 
improvement factor of the integrated algorithm over the decoupled algorithm 
increases significantly. This is to be expected, because the decoupled algorithm 
introduces more redundant reads and writes than the integrated algorithm. With 
high memory pressure, the tiles cannot be made very large, which results in an 
insufficient reduction of the redundant disk accesses. 

The measured results and the predicted results match well and the integrated 
fusion+tiling algorithm outperforms the decoupled datalocality algorithm. 

7 Conclusion 

We have described an optimization approach for synthesizing efficient out-of-core 
algorithms in the context of the Tensor Contraction Engine. We have presented 
two algorithms for performing an integrated fusion and tiling search. Our algo- 
rithms produce a set of candidate solutions, each with a fused loop structure 
and read and write operations for temporaries. After determining the tile sizes 
that minimize the disk access cost, the optimal solution is chosen. We have 
demonstrated with experimental results, that the integrated approach outper- 
forms a decoupled approach of first determining the fused loop structure and 
then searching for the optimal tile sizes. 



Table 2. Predicted and Measured I/O Time: a solution generated by the new fusion- 
datalocality algorithm for the AO-to-MO transform example 





Predicted Results(seconds) 


Measured Results (seconds) 


Array A 


21.3 


31 


Array B 


18.25 


14 


Array T2 


40.14 


41 


Arrays C1,C2,C3,C4 


0.052 


0.72 


Total time 


79.74 


86.7 
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Table 3. Comparison of predicted I/O time for the AO-to-MO transform example 



Ranges 


Decoupled 


Integrated 


Improvement factor 




100MB 


a=70, p=80 


1.882x 10"* sec 


0.747x10^ sec 


2.52 


a=200, p=300 


7.390 xlO 4 sec 


0.850 xlO 4 sec 


8.70 


a=500, p=600 


5.520x10® sec 


2.300x10 s sec 


24.00 




500MB 


a=70, p=80 


0.395x10^ sec 


0.395x10^ sec 


1.00 


a=200, p=300 


4.830 xlO 4 sec 


0.850 xlO 4 sec 


5.70 


a=500, p=600 


3.560x10® sec 


2.300x10® sec 


15.50 




2000MB 


a=70, p=80 


0.395x10^ sec 


0.395x10^ sec 


1.00 


a=200, p=300 


3.780X10 4 sec 


0.850 xlO 4 sec 


4.45 


a=500, p=600 


2.140x10® sec 


2.109x10® sec 


10.14 
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Abstract. This paper presents a programming model, an interface definition 
language (P-COM2) and a compiler that composes parallel and distributed 
programs from independently written components. P-COM2 specifications 
incorporate infomiation on behaviors and implementations of components to 
enable qualification of components for effectiveness in specific application 
instances and execution environments. The programming model targets 
development of families of related programs. One objective is to be able to 
compose programs which are near-optimal for given application instances and 
execution environments. Component-oriented development is motivated for 
parallel and distributed computations. The programming model is defined and 
described and illustrated with a simple example. The compilation process is 
briefly defined and described. Experience with one more complex application, 
a generalized fast multipole solver is sketched including performance data, 
some of which was surprising. 



1 Introduction 

This paper presents a language (P-COM 2 ) 1 and a compiler that composes parallel and 
distributed programs from independently written components and illustrates their 
application. P-COM 2 is an interface definition language which incorporates 
information on behaviors and implementations of components to enable qualification 
of components for effectiveness in specific application instances and execution 
environments. The general strategy is somewhat similar to composition of programs 
in the Web Services paradigm but the goals are quite different. A component is a 
serial program which is encapsulated by an associative interface [8,11] which 
specifies the properties of the component. The composition implemented by the 
compiler is based on matching of associative interfaces and generates as final output 
either an MPI program or multi-threaded code for a shared memory multi-processor. 
The CODE [26] parallel programming system is used as an intermediate language and 
is the immediate target language of the compositional compiler. 

Component-oriented software development is one of the most active and 
significant threads of research in software engineering [1,10,15,29]. There are many 
motivations for raising the level of abstraction of program composition from 
individual statements to components with substantial semantics. It is often the case 



1 P-COM 2 stands for Parallel COMposition from COMponents. 
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that there is a family of applications which can be generated from a modest number of 
appropriately-defined components. Optimization and adaptation for different 
execution environments is readily accomplished by creating and maintaining multiple 
versions of components rather than by direct modifications of complete applications. 
Programs generated and maintained as compositions of components are much more 
understandable and thus much more readily modifiable and maintainable. 

Even though there are additional benefits to component-oriented development in 
the distributed and parallel domain 2 , there has been relatively little research on 
component based programming in the context of high performance parallel and 
distributed programming. (Section 8 summarizes related work.) The execution 
environments for parallel programs are much more diverse than those for sequential 
programs. It is often necessary to maintain multiple versions of parallel programs for 
different execution environments. Program development by composition of 
components enables adaptation of parallel programs to different execution 
environments and optimization for different application instances by replacement of 
components. Adaptive control of parallel and distributed programs [3] is also enabled 
by replacement of components. Management of adaptations such as degree of 
parallelism and load balancing are readily accomplished at the component level. 
Parallelism is most often determined by the number of instances of a component 
which are executing in parallel (SPMD parallelism). The P-COM 2 language and the 
compiler explicitly make provision for dynamic SPMD parallelism It has also been 
found that viewing programs as compositions of components tends to lead to 
programs with better structuring and better performance even for sequential versions. 

P-COM 2 approaches component-oriented development of parallel and distributed 
programs from a different perspective than most other projects. The principal 
concerns and goals for the P-COM 2 project have been to enable automation or at least 
partial automation of composition through a compiler, to develop a mechanism 
enabling runtime adaptation of parallel and distributed programs at the component 
level [3] and to enable performance-oriented, evolutionary development of parallel 
and distributed programs. This paper covers the first topic, compiler-implemented 
composition. The P-COM 2 interface definition language incorporates information on 
component properties and behaviors as well as function/procedure/method interfaces 
including an implicit state machine to sequence invocations of components with 
internal state. Additionally the P-COM 2 system targets development of families of 
programs with instances of the family targeting given application instances or given 
execution environments. 

The P-COM 2 language and compiler have been used in implementing some 
substantial programs. One of the applications is to construct components and 
compose programs for solving linear equations using a fast multipole solver (FMM). 
The FMM code can be formulated in either a memory intensive or computation 
intensive formulation and at points in between. It is complex to write a parameterized 
program spanning these options but they are readily composed from parameterized 
components. The compiler has also been applied in the composition of parallel 



2 CORBA, Web Services, etc. which are very much component-oriented development 
systems, are not commonly used for development of parallel or high performance 
applications. 
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method of lines (MOL) codes for solving time dependent partial differential 
equations. MOL also has a great number of possible configurations and runtime 
adaptations. 

The remainder of the paper is organized in the following way. Section 2 explains 
some terms and concepts used in the compiler. Next, the programming model, the 
language and the compilation process are described in section 3, 4, and 5 
respectively. Then a simple program, a macro-parallel FFT algorithm [32], is used to 
introduce the programming model, the programming language (which is an interface 
definition language) and the compilation process in section 6. The components and 
compilation process and a short discussion of the FMM code is given in Section 7. 
Section 8 discusses related work in this area. The paper is concluded and some future 
directions are discussed in Section 9. 



2 Definition of Terms and Concepts 

Domain Analysis: Domain analysis [5] identifies the components from which a 
family of programs in the domain can be constructed and identifies a set of attributes 
in which the properties and behaviors of the components can be defined. It is usually 
the case that applications require components from multiple domains. 

Component: A component is one or more sequential computations, an interface 
which specifies the information used for selection and matching of components and a 
state machine which manages the interface, the interactions with other peers and the 
invocation of the sequential computations. An interaction, which may be initiated as 
an incoming message (or set of messages) or as an invocation of a transaction, will 
trigger an action which is associated with some state of the state machine. The action 
may include execution of a sequential computation. 

Sequential Computation: A computation is a unit of work that implements some 
atomic functionality. A computation is a sequential program which refers only to its 
own local variables and its input variables. 

Associative Interface: An associative interface [8] encapsulates a component. It 
describes the behavior and functionality of a component. One of the most important 
properties of associative interfaces is that they enable differentiation among 
alternative implementations of the same component. These interfaces are called 
"associative" because selection and matching is similar to operations on content- 
addressable memories. An associative interface consists of an accepts specification 
and a requests specification. 

Accepts Specification: An accepts interface specifies the set of interactions in which 
a component is willing to participate. The accepts interface for a component is a set of 
three-tuples ( profile , transaction , protocol ). 

• A profile is a set of attribute/value pairs. Components have a priori agreement on 
the set of attributes and values which can appear on the accepts and requests 
interface of a component. 
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• A transaction specification incorporates one or more function signatures 
including the data types, functionality and parameters of the unit of work to be 
executed and a state machine which manages the order of execution of the units 
of work. The state machine is defined in the form of conditional expressions over 
states and function signatures. A transaction can be enabled or disabled based on 
its current state and its current state can be used in runtime binding of the 
components. Multiple transactions controlled by the state machine can be used to 
represent complex interactions such as precedence of transactions, "and" 
relationships among transactions acting as a barrier and "or" relationships 
between transactions representing alternative ways of executing the component. 

• A protocol defines a sequence of simple interactions necessary to complete the 
interaction specified by the profile. The most basic protocol is data-flow 
(continuations), which is defined as executing the functionality of a component 
and transmitting the output to a successor defined by the selectors at that 
component without returning to the invoking component. More complex 
interaction protocols such as call-return and persistent transactions are planned 
but not yet implemented. 

Requests Specification: A requests interface specifies the set of interactions which a 
component must initiate if it is to complete the interactions it has agreed to accept. 
The requests interface is a set of three-tuples ( selector , transaction , protocol ). A 
component can have multiple tuples in its requests interface to implement its required 
functionality. 

• A selector is a conditional expression over the attributes of all the components in 
the domain. 

• Transaction specifications are similar to those for accepts specifications. 

• Protocol specifications are as given for accepts specifications. 

Start Component: A start component is a component that has at least one requests 
interface and no accepts interface. Every program requires a start component. There 
can be only one start component in a program which provides a starting point for the 
program. 

Stop Component: A stop component is a component that has at least one accepts 
interface and no requests interface. A stop component is also a requirement for 
termination of a program. There can be more than one stop component of a program 
denoting multiple ending points for the program. 



3 Programming Model 

The domain-based, component-oriented programming model targets development of 
a family of programs rather than a single program. The programming model has two 
phases: development of families of components and specification of instances from 
the family of programs which can be instantiated from the sets of components. 
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3.1 Component Development 

The set of components which enables construction of a family of application 
programs may include components which utilize different algorithms for different 
problem instances or different implementation strategies for different execution 
environments. A program for a given problem instance or given execution 
environment is composed from appropriate components by selecting desired 
properties for the components and the properties of the execution environment in the 
Start component. The steps for developing components are: 

a. Domain Analysis - Execute the necessary domain analyses. It is usually the 
case that applications require components from multiple domains. 

b. Component Development - Specify and either design and implement or 
discover in existing libraries, the family of components identified in the 
domain analysis in an appropriate sequential procedural language. 

c. Encapsulate - Encapsulate the components in the P-COM 2 interface definition 
language using the attributes identified in the domain analysis to specify 
associative interfaces for the components. The interfaces must differentiate the 
components by identifying their properties in terms of the attributes defined in 
the domain analysis. 

3.2 Program Instance Development 

The steps in specifying a given instance of an application are: 

a. Analyze the problem instance and the target execution environment. Identify 
the attributes and attribute values which characterize the components desired 
for this problem instance and execution environment. 

b. Identify the components from which the application instance will be 
composed. If the needed components are not available then some additional 
implementations of components may be necessary together with an extension 
of the domain analysis. 

c. Identify the dependence graph of the application instance. The dependence 
graph is expressed in terms of the components identified. Specify the number 
of replications desired for parallelism and for fault-tolerance. Incorporate 
these specifications into the component interfaces or as parameters in the Start 
component if parameterized parallelism has been incorporated into the 
component interfaces. 

d. Define a Start component which initializes the replication parameters, sets 
attribute values needed to ensure that the desired components are selected and 
matched. 

e. Define at least one Stop component. 
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4 The Interface Definition Language- P-COM 2 

The fundamental concepts underlying the interface definition language were given in 
Section 2. The syntax will be illustrated in the example in Section 6. Here we 
discuss what is expressed in the interfaces specifiable in the language. 

The language is rooted on the domain analyses for the program family. The 
domain analyses specify problem domain knowledge. It is expected that an 
application developer should be able, once familiar with the concepts of domain 
analysis, to generate domain analyses for a family of codes in her/his area of 
expertise. The associative interfaces define the behaviors of the components and will 
usually give properties of a given component's implementation of its functionality. 
Properties of desired implementations such as degree of parallelism for a given 
component are also specified in the associative interface as runtime determined 
parameters. It is often desirable for a component to retain state across executions. 
There may be precedence or sequencing relations among the transactions 
implemented by a component. Precedence and sequencing information is also 
specified in the interface as an implicit state machine implemented as a conditional 
expression over the states of the components and the transaction specifications. 
Finally, the protocol specification enables choice among interaction modes (Although 
only one is currently implemented). 



5 Compilation Process 

The conditional expression of a selector is a template which has slots for attribute 
names and values. The names and values are specified in the profiles of other 
components of the domain. Each attribute name in the selector expression of a 
component behaves as a variable. The attribute variables in a selector are instantiated 
with the values defined in the profile of another component. The profile and the 
selector are said to match when the instantiated conditional expression evaluates to 
true. 

The source program for the compilation process is a start component with a 
sequential computation which implements initialization for the program and a 
requests interface which specifies the components implementing the first steps of the 
computation and one or more libraries to search for components. The libraries should 
include the components needed to compose a family of applications specified by a 
domain analysis. The set of components which is composed to form a program is 
primarily dependent on the requests interface of the start component. 

The target language for the compilation process is a generalized data flow graph as 
defined in [26]. A node in this data flow graph consists of an initialization, a firing 
rule, a sequential computation and a routing rule for distribution of the outputs of the 
computation. There are two special node types, a start node and a stop node. 
Acceptable data flow graphs must begin with a start node and terminate on a stop 
node. 

The compilation process starts by parsing the associative interface of the start 
component. The compiler then searches a specified list of libraries for components 



Compositional Development of Parallel Programs 115 



whose accepts interface matches with the requests interface of the start component. 
The matching process is actually not much more than a sophisticated type matching. 
If the matching between the selector of one component and the profile of another 
component is successful, the compiler tries to match the corresponding transactions of 
the requests and accepts interface. The transactions are said to match when all of the 
following conditions are true. 1) The name of the two transactions is the same. 2) The 
number of arguments of each of the two transactions is the same. 3) The data type of 
each argument in the requests transaction is the same as that of the corresponding 
argument in the accepts transaction. 4) The sequencing constraint given by the 
conditional expression in the accepts transaction specification (the state machine) is 
satisfied. Finally the protocol specifications must be consistent. 

When compilation of the start component is completed, it is converted into a start 
node [26] for the data flow graph which will represent the parallel program and each 
match of a requests interface to an accepts interface results in addition of a node to 
the data flow graph which is being incrementally constructed by the compilation 
process and an arc connecting the this new node to the node which is currently being 
processed by the compiler. If there is a replication clause in a transaction 
specification then at runtime the specified number of replicas of the matched 
component are instantiated and linked with data flow arcs. This searching and 
matching process for the requests interface is applied recursively to each of the 
components that are in the matched set. The composition process stops when no more 
matching of interfaces is possible which will always occur with a Stop component 
since a Stop component has no requests interface. Compilation of a P-COM 2 stop 
component results in generation of a stop node for the data flow graph. The compiler 
will signal an error if a requests interface cannot be matched with an accepts interface 
of a desired component. The data flow graph which has been generated is then 
compiled to a parallel program for a specific architecture by compilation processes 
implemented in the CODE [26] parallel programming system. 

6 Example Program 

This section presents an example program showing the complete process of 
developing a parallel program for the fast Fourier transformation (FFT) of a matrix in 
two dimensions from simple components. The algorithm presented is an adaptation 
of Swarztrauber's multiprocessor FFT algorithm [32]. This problem is simple enough 
to cover in detail and illustrates many of the important concepts such as stateful 
components and precedence constraints. Given an N x M matrix of complex numbers 
where both N and M are powers of 2, we want to compute the 2D FFT of the 
complex matrix. This 2D FFT can be described in terms of ID FFTs, which helps in 
parallelizing the algorithm. Let us assume that there are P available processors where 
P is also a power of 2. In this case the domain analysis is straightforward and is an 
analysis of the algorithm itself. The steps of the algorithm are following: 

a) Partitioning the matrix row wise (horizontally) into P submatrices, one for 
each processor. 

b) Sending these submatrices to each of the P processors for computation. The 
size of each the submatrix is N/P x M. 
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c) Each processor performs a ID FFT on every row of the submatrix that it 
received. 

d) Collecting these ID FFT's and then transposing the N x M matrix. The 
resulting matrix is of size M x N. 

e) Splitting the M x N matrix row wise into P submatrices. The size of each of 
the submatrix is M/P x N. 

f) Sending these submatrices to the each of the P processors for computation. 

g) Again each processor performs a 1-D FFT on every row of the submatrix 
that it received. 

h) Collecting all the submatrices from the P processors and transposing the M x 
N matrix to get an N x M matrix. The resulting N x M matrix is the 2D FFT 
of the original matrix. 

This simple analysis suggests that all of the instances of this algorithm can be 
created from composing instances of three components: a one-dimensional FFT 
component, a component which partitions and distributes matrices and a component 
which merges rows or columns to recover a matrix and which may optionally 
transpose the recovered matrix. Let us name the components as fftrow, distribute, 
and gathertranspose respectively. One could as well formulate the algorithm with 
separate components for merge and transpose but that could introduce additional 
communication. Or the algorithm can use any ID FFT algorithm to calculate the 2D 
FFT of the matrix. Additionally the choice of implementation for transposition of an 
array may vary with execution environment. Note that each of the components above 
can be reused as each of them is actually used twice in the algorithm. These 
components could reasonably be expected to be found as "off the shelf' component 
which can be found and reused from linear algebra and fft libraries. Other than the 
above three components we need a component that will read/initialize the matrix and 
one component to print out the final result. Let us name the component as initialize 
and print. The component to read/initialize the array may be the Start component and 
the print component may be the Stop component. The Start component will be 
written to specify the set of component instances which will be composed for a given 
data set and target execution environment. 




Fig. 1 . Data Flow Graph of 2D FFT Computation 



The depencence graph of the program in terms of these components is shown in 
Fig. 1. This data flow graph suggests an optimization of creating a new component 
which combines the functions of distribute and gather_transpose. This depending on 
the mapping of nodes to processors, could eliminate two transmissions of the large 
matrix. As shown in Figure 1, parallelism can be achieved through the use of 
multiple fft row components. Note that the gather transpose component has to keep 
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track of its state as it sends data to the distribute component on its first execution and 
to the print component after its second execution. 

Once we have identified the components, the next step is to complete the domain 
analysis by defining a list of attributes through which we can describe the functions, 
behaviors and implementations of a component and their instantiations. When some 
service is required it is described in terms of the attributes in the format of accepts 
and requests interfaces. 

The two domains from which this computation is composed are the matrix and fft 
domains. There is a generic attribute "Domain" which is required for multi-domain 
problems. The matrix domain has these distinct attributes: 

a) Function: an attribute of type string. Describes its function. 

b) Elementtype: an attribute of type string. Describes the type information of 
the input matrix. 

c) Distribute_by_row: an attribute of type boolean. Describes whether the 
component partitions the matrix by row or by col. 

The fft domain has these attributes: 

a) Input: an attribute of type string. Describes the input structure. 

b) Element type: an attribute of type string. Describes the type information of 
the input. 

c) Algorithm: an attribute of type string. 

d) Apply_per_row: an attribute of type boolean. Describes whether to apply the 
FFT function per row or per column. 

The completed domain analysis for the components is shown in Figure 2. Once the 
domain analysis is done, we encapsulate the components in associative interfaces 
using the attributes and transactions. 

As shown in Figure 3, the requests interface of the initialize component specifies 
that it needs a component that can distribute a matrix row-wise. The interface passes 
real and imaginary parts of the matrix, the dimension of the matrix and the total 
number of processors to the distribute component using the transaction specification. 
The data type mat2 is defined as a two dimensional array data type. 



Fft row 

a) Domain: lit 

b) Input: matrix 

c) Element_type: complex 

d) Algorithm: Id-fft 

e) Apply_per_row: true 


Gathertranspose 

a) Domain: matrix 

b) Function: gather 

c) Element_type: complex 

d) Combinebyrow: true 

e) Transpose: true 


Distribute 

a) Domain: matrix 

b) Function: distribute 

c) Element type: complex 

d) Distribute by row: true 


Print 

a) Domain: print 

b) Input: matrix 

c) Elementtype: complex 



Fig. 2. Domain Analysis of the Components 

Fig. 4. a shows the accepts interface of the distribute component. This distribute 
component assumes that the matrix which it partitions and distributes will be merged. 
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This is specified in Figure 4b. The first selector interfaces to the gathertranspose 
component providing the size of each of the submatrices, the total number of 
submatrices to collect at the gather transpose component and also state information 
which is needed in the gather transpose component. The second selector in Figure 4b 
specifies that it needs p instances of the fftrow component and distributes the 
submatrices to each of the replicated components along with their size. The construct 
"index [p]" is used to specify that multiple copy of the fft row component are needed. 
The construct "[]" with the transaction argument is used to transmit different data to 
different copies of component. For different transmission patterns, different 
constructs may be used in the language of the interface. Note that the number of 
instances of the fft_row component is determined at runtime. 

selector : 

string domain == "matrix" ; 
string function == "distribute"; 
string element_type == "complex"; 
bool distribute_by_row == true; 

transaction: 

int get_matrix (out mat2 grid_re,out mat2 grid_im, out 
int n, out int m, out int p) ; 

protocol: dataflow; 



Fig. 3. Requests Interface of Initialize Component 



profile : 

string domain = "matrix" ; 
string function = "distribute" 
string element_type = "complex"; 
bool distribute_by_row = true; 
transaction : 

int get_matrix (in mat2 grid_re,in mat2 grid_im, in int n, 
in int m, in int p) ; 

protocol: dataflow; 



Fig. 4a. Accepts Interface of distribute component 



Fig. 5. a specifies that this implementation of fft row component uses the "Cooley- 
Tukey" algorithm [13]. The fft_row component requires no knowledge of how many 
copies of itself are being used. From Fig. 5.b, we can see that the instance number of 
the fft row component is passed to the gather transpose component using the 
variable "me". 

Figure 6a illustrates the use of the ">" operator between the transactions to 
describe the precedence relationship between the transactions. The second transaction 
cannot execute until the first transaction is completed. The gather transpose 
component collects the submatrices one by one through the second transaction in the 
interface. P-COM2 incorporates precedence ordering operations sufficient to express 
simple state machines for management of interactions among components. 

As shown in Fig. 6.b, the first requests interface of the gather transpose 
component is used to connect to the distribute component. The second interface 
connects to the print component. The variable “state” is used to enable one of the 
transactions based on the current state of the gather transpose component. 
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selector : 

string domain == "matrix"; 
string function == "gather"; 
string element_type == "complex"; 
bool combine_by_row == true; 
bool transpose == true; 
transaction: 

int get_p(out int n/p, out int m, out int p, 
out int state) ; 
protocol: dataflow; 

{ selector : 

string domain == "fft"; 
string input == "matrix" ; 
string element_type == "complex"; 
string algorithm == "Cooley-Tukey"; 
bool apply_per_row == true; 
transaction: 

int get_grid_n_m (out mat2 out_grid_re [] , out mat2 

out_grid_im [] , out int n/p, out int m) ; 

protocol: dataflow; 

} index [ p ] 



Fig. 4b. Requests Interface of distribute component 



profile : 

string domain = "fft" ; 
string input = "matrix" ; 
string element_type - "complex"; 
string algorithm = "Cooley-Tukey"; 
bool apply_per_row = true; 
transaction : 

int get_grid_n_m ( in mat2 grid_re,in mat2 grid_im, in int n, 
in int m) ; 

protocol: dataflow; 



Fig. 5a. Accepts Interface of Fft row Component 



selector : 

string domain == "matrix" ; 
string function == "gather" ; 
string element_type == "complex"; 
bool combine_by_row == true; 
bool transpose == true; 
transaction : 

int get_grid_n_m_inst (out mat2 out_grid_re , out mat2 
out_grid_im, out int me) ; 

protocol: dataflow; 



Fig. 5b. Requests Interface of Fft row Component 



120 Nasim Mahmood et al. 



profile : 

string domain = "matrix” ; 
string function = "gather", - 
string element_type = "complex"; 
bool combine_by_row = true; 
bool transpose = true; 
transaction: 

int get_p(in int n, in int m, in int p,in int state); 

> 

int get_grid_n_m_inst (in mat2 grid_re,in mat2 grid_im, 
in int inst) ; 

protocol: dataflow; 



Fig. 6a. Accepts Interface of Gather_transpose Component 



selector : 

string domain == "matrix"; 
string function == "distribute"; 
string element_type == "complex"; 
bool distribute_by_row == true; 
transaction : 

%{ state == 1, gathered == p }% 

int get_matrix (out mat2 out_grid_re , out mat2 out_grid_im, 
out int m, out int n*p, out int p) ; 
protocol: dataflow; 
selector : 

string domain == "print"; 
string input == "matrix" ; 
string element_type == "complex"; 
transaction: 

%{ state == 2, gathered == p }% 

int get_grid_n_m (out mat2 out_grid_re , out mat2 

out_grid_im, out int m,out int n*p) ; 

protocol: dataflow; 



Fig. 6b. Requests Interface of Gather_transpose Component 



7 Case Study - A Generalized Fast Multipole Solver 

The Fast Multipole Method (FMM) [20,21], which solves the N-body electrostatics 
problems in O(N) rather than 0(N 2 ) operations, is central to fast computational 
strategies for particle simulations. The FMM is also useful for iterative solution of 
linear algebraic equations associated with approximate solution of integral equations. 
There the FMM is used for O(N) matrix-vector multiplication. In order to adapt the 
FMM for applications in fluid and solid mechanics, the classical electrostatics 
problem must be replaced with a generalized electrostatics problem [17,18]. Such 
problems involve vector and tensor valued charges, which means that one generalized 
electrostatics problem is equivalent to several classical electrostatics problems, which 
share the same geometry. In particular, FLEMS code [17] relies on the generalized 
electrostatics problem that is equivalent to 13 classical electrostatics problems. 

We have performed a domain analysis for the FMM for generalized (multiple 
charge type) electrostatics. For example, the FMM tree has certain attributes, such as 
its depth and its number of charges per cell and the application component has an 
attribute with values that select between classical and generalized electrostatics. For 
generalized electrostatics the number of charge types is an attribute. For each 
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attribute, the analysis defines a range of legal values. Components for a family of 
FMM codes for generalized electrostatics were derived from the FLEMS FMM 
implementation. These components were given associative interfaces that define their 
properties and behaviors and were annotated with domain attributes and architectural 
attributes. An instance of the component family can be specified by providing 
specific values for each attribute. An example of an attribute that would lead to 
different implementations is the number of charge types to be processed 
simultaneously. 

There are a family of space-computation tradeoffs which can be applied in the 
matrix-structured formulation [30] of the FMM algorithm which can be chosen to 
optimize the code for a given execution environment and problem specification. 
These include: 

• Simultaneous computation of cell potentials for multiple charge types. 

• Use of optimized library routines for vector-matrix multiply. 

• Use of optimized library routines for matrix-matrix multiply. 

• Loop interchange over the two outer loops to improve locality (Within a 
component). 

• Number of terms in the multipole expansion. 

There are many variants of these structures and interactions among them. The 
original FMM implementation in the FLEMS code is approximately 4500 lines in 
length with the logic distributed throughout the code. Manual construction of 
optimized versions for even a modest number of execution environments would lead 
to rather complex code. But a small number (eight) of components characterized by 
the number of charges which are simultaneously computed and the number of terms 
in the multipole expansion suffice to realize an important subset of execution 
environment optimized codes. 

The FMM includes five translation theorems: 

• Particle charge to Multipole (P2M is applied at the finest partitioning level) 

• Multipole to Multipole (M2M is applied at all partitioning levels, from the finest 
to the coarsest) 

• Multipole to Local (M2L is applied at all partitioning levels) 

• Local to Local (L2L is applied at all partitioning levels, from the coarsest to the 
finest) 

• Local to Particle potential and forces (L2P is applied at the finest partitioning 
level) 

Two kinds of components are needed structure the FMM computation framework. 
The first category comes directly from the FMM algorithm. The five translation 
theorems, charges-to-multipole, multipole-to-multipole, multipole-to-local. local-to- 
local, local-to-potential and force, and direct-interaction calculation belong to this 
category. The second category contains the communication components, distribute 
and collect which actually also derive from the FMM algorithm since they implement 
distribution and collection according to the interaction lists for each partition of the 
domain.. 

The data flow graph for the FMM code for two processors is shown in Fig. 7. 



122 Nasim Mahmood et al. 




Fig. 7. Data flow Graph of FMM code 



An extensive set of performance studies were made comparing the original and 
componentized sequential codes. Preliminary results are reported [16] and a more 
detailed paper is in preparation. The performance of the sequential componentized 
code, contrary to conventional wisdom, is up to 15 times faster than the original 
implementation which had itself been optimized by several generations of students 
and post-doctoral fellows. This surprising result is largely due to specialization of 
functionality based on selection of optimal components and replacing loop 
implementations of matrix-matrix multiply by BLAS implementations of matrix- 
matrix multiply. Table 1 shows a small sample of the performance data obtained. The 
data was taken on a Linux cluster of Pentium Ill's at 1.8 Gigahertz and a 100MB 
Ethernet interconnect. There are approximately half a million charges in this system. 
There are two factors to be noted: (i) Speedup is near-linear for the small number of 
processors and (ii) the time increases less than linearly with the number of charge 
types due to the change due to optimizations local to components. 



Table 1 . Performance data for tree depth of four 



Number of 
Charge Types 


Run time on 2 
processors 
(Seconds) 


Run time on 4 
processors 
(Seconds) 


Run time on 8 
processors 
(Seconds) 


5 


413.84 


215.52 


121.11 


12 


561.53 


305.50 


254.14 



8 Related Research 

There has been relatively little research on component based programming in the 
context of parallel and distributed program. Darwin [25] is a composition and 
configuration language for parallel and distributed programs. Darwin uses a 
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configuration script to compose programs from components. This composition 
process is effectively manual. In our approach, the composition information 
encapsulates the components themselves, as a result the compiler can choose the 
required component automatically. 

The component-based software development environment [23,28] of the SciRun 
project feature powerful graphical composition of data flow graphs of components 
which are compiled to parallel programs. H20 [31] is a component-oriented 
framework for composition of distributed programs based on web services. Triana 
[33] is a graphical development environment for composing distributed programs 
from components targeting peer to peer execution environments. The G2 [24] 
composes distributed parallel programs from web services through Microsoft .Net. 
Armada [27] composes distributed/parallel programs specialized to data movement 
and filtering. 

The Common Component Architecture (CCA) project [6] is a major research and 
development project focused on composition of parallel programs from components. 
One primary goal of CCA is to enable composition of programs from components 
written in multiple languages. CCA has developed interface standards. The 
implementations of the CCA interface specifications are object-oriented. There are 
several tools, XCAT, [19] Ccaffeine [14] and BABEL [7,9] implementing the CCA 
interface specification system. Component composition are either graphical or 
through scripts and make files. CCA components interact through two types of ports. 
The first type of port is the provides port. The provides port is an interface that 
components provide to other components. The second type of port is the uses port. It 
is an interface through which components connects with other components which 
they require. These port type exhibit some similarities to the accepts and requests 
transaction specifications. However, the details and implementations are quite 
different as we have focused on incorporation of the information necessary to enable 
composition by compilation. 

ArchJava [4] annotates ports with provides and requires methods which helps the 
programmer to better understand the dependency relations among components by 
exposing it to the programmer. The accepts and requests interface of a P-COM 2 
component incorporate signatures as do ArchJava provides and requires. The accepts 
and requests interfaces also include profiles and precedence specification carrying 
semantic information and enabling automatic program composition. The attribute 
name/value pairs in profiles are used for both selecting and matching components 
thereby providing a semantics-based matching in addition to type checking of the 
matching interfaces. 

The use of associative interface has been reported earlier in the literature. 
Associative interface is used in one broadcast based coordination model [12]. This 
model uses run time composition, whereas our paper presents compile time 
composition. Associative interfaces have also been reported in composition of 
performance modeling [11]. 
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9 Conclusion and Future Research 

This paper has presented a programming model, a programming system and a 
compiler for composing distributed and parallel programs from independently written 
components. The conceptual foundations are domain analysis, support for families of 
programs, integration and automation of discovery and linking and management of 
components with state. 

The component-based development method described and illustrated in this paper 
is not intended for development of small or "one-off" applications. The investment of 
effort in domain model development and characterization and encapsulation of 
components is not trivial and these software engineering methods are not typically a 
part of the development process for high performance applications. The target 
applications are those where several instances of an application are to be developed, 
where the application may need to be optimized for several different execution 
environments or where the application is expected to evolve over a substantial period 
of time. In such cases the investment of effort in domain model development and 
characterization and encapsulation of components can be expected to show return. 
That being said, the parallel programs which have been developed to demonstrate and 
evaluate the method show good performance and are readily evolvable. 

We are currently investigating the feasibility of combining runtime [12] and 
compile-time composition of associative interfaces. We plan to implement a hybrid 
graphical composition and compiler-based composition system. We also plan to 
integrate the compositional compiler with the Broadway annotational compiler [22] to 
overcome the problem of "too many components." Finally we are working on 
additional applications including an hp-adaptive finite element code. 
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Abstract. Development of applications that process large scientific 
datasets is often complicated by complex and specialized data storage 
formats. In this paper, we describe the use of XML technologies for sup- 
porting high-level programming methodologies for processing scientific 
datasets. We show how XML Schemas can be used to give a high-level 
abstraction of a dataset to an application developer. A corresponding 
low-level Schema describes the actual layout of data and is used by the 
compiler for code generation. The compiler needs a systematic way for 
translating the high-level code to a low-level code. Then, it needs to 
transform the generated low-level code to achieve high locality and ef- 
ficient execution. This paper describes our approach to these two prob- 
lems. By using Active Data Repository as the underlying runtime system, 
we offer an XML based front-end for storing, retrieving, and processing 
flat-file based scientific datasets in a cluster environment. 



1 Introduction 

Processing and analyzing large volumes of data is playing an increasingly im- 
portant role in many domains of scientific research. Large datasets are being 
created by scientific simulations, or arise from digitization of images and/or 
from data collected by sensors and other instruments. A variety of analysis can 
be performed on such datasets to better understand scientific processes. 

Development of applications that process large scientific datasets is often 
complicated by complex and specialized data storage formats. When the datsets 
are disk-residents, understanding the layout and maintaining high locality in 
accessing them is crucial for obtaining a reasonable performance. While the 
traditional relational database technology supports high-level abstractions and 
standard interfaces, it is suitable more for storing and retrieving datasets, and 
not for complex analyses on such datasets [12]. 

Recently, there has been a lot of interest in XML and other related technolo- 
gies developed by the W3C consortium [5]. XML is a flexible exchange format 
that can represent many classes of data, including structured documents, hetero- 
geneous and semi-structured records, data from scientific experiments and simu- 
lations, and digitized images. One of the key features of XML is XML Schemas, 
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which serve as a standard basis for describing the contents and structure of 
a dataset. 

In this paper, we describe the use of XML technologies for supporting high- 
level programming methodologies for processing scientific datasets. We partic- 
ularly show how XML Schemas can be used to give a high-level abstraction of 
a dataset to the application developers, who can use such a high-level Schema 
for developing the applications. A corresponding low-level Schema describes the 
actual layout of data, but is hidden from the programmers. The compiler can 
use the source code, the low-level Schema, and the mapping from the high-level 
Schema to the low-level Schema for code generation. 

Two key compiler techniques are required for supporting such an approach. 
First, we need a systematic way to translate the high-level code to the low-level 
code. Second, we need to transform the generated low-level code to achieve high 
locality and efficient execution. This paper describes our approach to these two 
problems. Our techniques have been implemented in a compilation system. By 
using Active Data Repository [6, 7] as the underlying runtime system, we offer 
an XML based front-end for storing, retrieving, and processing flat-file based 
scientific datasets in a cluster environment. 

As part of our system, we use the XML query language XQuery [4] for writ- 
ing queries using high-level abstractions. XQuery is derived from declarative, 
database, as well as functional languages. Though XQuery significantly simpli- 
fies the specification of processing, compiling it to achieve efficient execution 
involves a number of new challenges. Our recent related paper has addressed 
two key issues, i.e, replacing recursive reductions by iterative constructs and 
type-inferencing to translate from XQuery to an imperative language [11]. 

2 Background: XML, XML Schemas, and XQuery 

This section gives background on XML, XML Schemas, and XQuery. 



2.1 XML and XML Schemas 

XML provided a simple and general facility which is useful for data interchange. 
Though the initial development of XML was mostly for representing structured 
and semi-structured data on the web, XML is rapidly emerging as a general 
medium for exchanging information between organizations. For example, a hos- 
pital generating medical data may make it available to other health organizations 
using XML format. Similarly, researchers generating large data-sets from scien- 
tific simulations may make them available in XML format to other researchers 
needing them for further experiments. 

XML models data as a tree of elements. Arbitrary depth and width is allowed 
in such a tree, which facilitates storage of deeply nested data structures, as well 
as large collections of records or structures. Each element contains character 
data and can have attributes composed of name-value pairs. An XML document 
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< student > 

< firstname > Darin < / firstname > 

< lastname > Sundstrom < /lastname > 
<DOB > 1974-01-06 < / DOB > 

< GPA > 3.73 < / GPA > 

< / student > 



(a) XML example 

Schema Declaration 

< xs:element name-' student” > 

< xsicomplexType > 

< xs: sequence > 

< xs:element name=” lastname” type=”xs:string”/ > 

< xs:element name-’ firstname” type=”xs:string”/ > 

< xs:element name=”DOB” type=”xs:date”/^ 

< xs:element name= ”GPA” type=”xs:float”/ > 

< /xs: sequence > 

< /xsicomplexType > 

< /xs: element > 

(b) XML Schema 



Fig. 1. XML and XML Schema 



represents elements, attributes, character data, and the relationship between 
them by simply using angle brackets. 

Note that XML does not specify the actual lay-out of large data on the disks. 
Rather, if a system supports a certain data-set in an XML representation, it must 
allow any application expecting XML data to properly access this data-set. 

Applications that operate on XML data often need guarantees on the struc- 
ture and content of data. XML Schema proposals [2, 3] give facilities for de- 
scribing the structure and constraining the contents of XML documents. The 
example in Figure (a) shows an XML document containing records of students. 
The XML Schema describing the XML document is shown in Figure (b). For 
each student tuple in the XML file, it contains two string elements to specify 
the last and first names, one date element to specify the date of birth, and one 
element of float type for the student’s GPA. 

2.2 XML Query Language: XQuery 

As stated previously, XQuery is a language currently being developed by the 
World Wide Web Consortium (W3C). It is designed to be a language in which 
queries are concise and easily understood, and to be flexible enough to query 
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for $d in document(”depts.xml”)//deptno 

let $e := document (”emps.xml”)//emp[deptno = $d] 
where count($e) >= 10 
return 

< big- dept > 

{ 

$d, 

< headcount > { count ($e) } < /headcount >, 
<avgsal> (avg($e/salary)} </avgsal> 

} 

</big-dept> 



Fig. 2. An Example Illustrating XQuery’s FLWR Expressions 



a broad spectrum of information sources, including both databases and docu- 
ments. 

XQuery is a functional language. The basic building block is an expression. 
Several types of expressions are possible. The two types of expressions important 
for our discussion are: 

— FLWR expressions, which support iteration and binding of variables to inter- 
mediate results. FLWR stands for the keywords for, let, where, and return. 

— Unordered expressions, which use the keyword unordered. The unordered 
expression takes any sequence of items as its argument, and returns the 
same sequence of items in a nondeterministic order. 

We illustrate the XQuery language and the for, let, where, and return expres- 
sions by an example, shown in Figure 2. In this example, two XML documents, 
depts.xml and emps.xml are processed to create a new document, which lists all 
departments with ten or more employees, and also lists the average salary of 
employees in each such department. 

In XQuery, a for clause contains one or more variables, each with an asso- 
ciated expression. The simplest form of for expression, such as the one used in 
the example here, contains only one variable and an associated expression. The 
evaluation of the expression typically results in a sequence. The for clause results 
in a loop being executed, in which the variable is bound to each item from the 
resulting sequence in turn. In our example, the sequence of distinct department 
numbers is created from the document depts.xml, and the loop iterates over each 
distinct department number. 

A let clause also contains one or more variables, each with an associated 
expression. However, each variable is bound to the result of the associated ex- 
pression, without iteration. In our example, the let expression results in the 
variable $e being bound to the set or sequence of employees that belong to the 
department $d. The subsequent operations on $e apply to such sequence. For 
example, count($e) determines the length of this sequence. 
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unordered( 

for $d in document (”depts.xml”)//deptno 

let $e := document(”emps.xml”)//emp[deptno = $d] 
where count($e) >= 10 
return 

< big- dept > 

{ 

$d, 

<headcount> { count($e) } </headcount>, 
<avgsal> (avg($e/salary)} </avgsal> 

} 

</big-dept> 



Fig. 3. An Example Using XQuery’s Unordered Expression 



A where clause serves as a filter for the tuples of variable bindings generated 
by the for and let clauses. The expression is evaluated once for each of these tu- 
ples. If the resulting value is true, the tuple is retained, otherwise, it is discarded. 
A return clause is used to create an XML record after processing one iteration 
of the for loop. The details of the syntax are not important for our presentation. 

To illustrate the use of unordered , a modification of the example in Figure 2 is 
presented in Figure 3. By enclosing the for loop inside the unordered expression, 
we are not enforcing any order on the execution of the iterations in the for loop, 
and in generation of the results. Without the use of unordered, the departments 
need to be processed in the order in which they occur in the document depts.xml. 
However, when unordered is used, the system is allowed to choose the order in 
which they are processed, or even process the query in parallel. 

3 System Overview 

In this section, we briefly introduce the overall architecture of our system. This 
discussion forms the basis for our description of the various compilation phases. 
An overview of the system is shown in Figure 4. 

Our target environment is a cluster of machines, each with an attached disk. 
To efficiently support processing on large disk-resident datasets and on a cluster 
architecture, our compiler generates code for a runtime system called Active Data 
Repository (ADR) [7, 6] . ADR run-time support has been developed as a set of 
modular services implemented in C++, which targets processing of datasets that 
are stored as flat files. Our system does not directly process XML datasets. As 
a physical lay-out standard, XML involves several-fold storage overheads. There- 
fore, for scientific applications that involve large datasets, XML is only beneficial 
as a logical lay-out standard. Here, the key advantage of XML technologies is 
that XML Schemas allow the users to view the data at a high-level. Conse- 
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Fig. 4. Overview of the System Architecture 



quently, an XML query language like XQuery can be used for specifying the 
processing a high-level, i.e. , keeping it independent of the details of the low-level 
layout of data. 

In our system, XML file will be mapped to flat files by XML mapping service 
according to a XML Schema. This XML Schema is called the high-level XML 
schema, because it describes an high-level abstraction of the dataset and does not 
expose any details of the physical layout of the dataset. The flat file generated by 
XML mapping service will then be distributed to disks of a cluster architecture by 
using data distribution service provided by ADR. A low-level XML Schema file 
reflecting the physical layout and meta-data information will be provided. High- 
level XML Schemas are known to the programmers when developing XQuery 
code, and will be used by the compiler for XQuery type checking. Low-level 
XML Schemas will guide the compiler in generating efficient codes executing on 
the disk-resident datasets. More details and examples of high-level and low-level 
Schemas will be given in the next section. 



4 High-Level and Low-Level Schemas and XQuery 
Representation 

This section focuses on the interface for the system. We use two motivating ex- 
amples, satellite data processing [7] and the multi-grid virtual microscope [1], for 
describing the notion of high-level and low-level schemas and XQuery represen- 
tation of the processing. 
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< xsielement name=”pixel” maxOccurs=” unbounded” > 

< xsxomplexType > 

< xs: sequence > 

< xs:element name=”x” type=”xs:integer”/ > 

< xs:element name=”y” type=”xs:integer”/ > 

< xs:element name=”date” type=”xs:date”/ > 

< xs:element name=”bandO” type=”xs:short”/ > 

< xs:element name=”bandl” type=”xs:short”/ > 

< /xs: sequence > 

< /xsxomplexType > 

< /xs:element > 



Fig. 5. High-Level XML Schema for Satellite 



4.1 Satellite Data Processing 

The first application we focus on involves processing the data collected from 
satellites and creating composite images. A satellite orbiting the Earth collects 
data as a sequence of blocks. The satellite contains sensors for five different bands. 
The measurements produced by the satellite are short values (16 bits) for each 
band. 

The XML Schema shown in Figure 5 provides a high-level abstraction of the 
satellite data. The pixels captured by the satellite can be viewed as a sparse three 
dimensional array, where time, latitude, and longitude are the three dimensions. 
Pixels for several, but not all, time values are available for any given latitude and 
longitude. Each pixel has 5 short integers to specify the sensor data. Also, lati- 
tude, longitude, and time is stored within each pixel. With this high-level XML 
Schema, a programmer can easily define computations processing the satellite 
data using XQuery. 

The typical computation on this satellite data is as follows. A portion of 
Earth is specified through latitudes and longitudes of end points. A time range 
(typically 10 days to one year) is also specified. For any point on the Earth within 
the specified area, all available pixels within that time period are scanned and an 
application dependent output value is computed. To produce such a value, the 
application will perform computation on the input bands to produce one output 
value for each input value, and then the multiple output values for the same 
point on the planet are combined by a reduction operation. For instance, the 
Normalized Difference Vegetation Index (NDVI) is computed based on bands 
one and two, and correlates to the “greenness” of the position at the surface of 
the Earth. Combining multiple ndvi values consists of execution a max operation 
over all of them, or finding the “greenest” value for that particular position. 

XQuery specification of such processing is shown in Figure 6. The code iter- 
ates over the two-dimensional space for which the output is desired. Since the 
order in which the points are processed is not important, we use the directive 
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unordered ( 

for $i in ($minx to $maxx) 
for Sj in ($miny to $maxy) 

let $p := document (’’satellite.xml”) /data/pixel 
where(( $p/x = Si) and ($p/x = $j )) 
return 
<pixel> 

<latitute> {$i} </latitute> 

<longtitute> {$j} </longtitute> 

<summary> (accumulate($p)} </summary> 
</pixel> 

) 



define function accumulate ($p) 
as double 

{ 

let $inp := item-at($p,l ) 

let SNVDI := ( ($inp/bandl - $inp/bandO) div 
($inp/bandl + $inp/bandO)+l) * 512 

return 

if( empty($p) ) 
then 0 

else { max($NVDI, accumulate(subsequence($p,2))) } 



Fig. 6. Satellite Data Processing Expressed in XQuery 



unordered. Within an iteration of the nested for loop, the let statement is used 
to create a sequence of all pixels that correspond to the those spatial coordi- 
nates. The desired result involves finding the pixel with the best NDVI value. In 
XQuery, such reduction can only be computed recursively. 

4.2 Multi-grid Virtual Microscope 

The Virtual Microscope [8] is an application to support the need to interactively 
view and process digitized data arising from tissue specimens. The raw data for 
such a system is captured by digitally scanning collections of full microscope 
slides at high power. In a typical dataset available when a virtual microscope 
is used in a distributed setting, the same portion of a slide may be available at 
different resolution levels, but the entire slide is not available at all resolution 
levels. 

A particular user is interested in viewing a rectangular section of the image 
at a specified resolution level. In computing each component of this rectangular 
section (output), it is first examined if that portion is already available at the 
specified resolution. If it is not available, then we next examine if it is available 
at a higher resolution (i.e. , at a smaller granularity). If so, the output portion 
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< xsielement name=”pixel” maxOccurs=” unbounded” > 

< xsxomplexType > 

< xs: sequence > 

< xs:element name=”x” type=”xs:integer”/ > 

< xs:element name=”y” type=”xs:integer”/ > 

< xs:element name=”scale” type=”xs:short”/ > 

< xs:element name=”colorl” type=”xs:short”/ > 

< xs:element name-’ color2” type=”xs:short”/ > 

< xs:element name-’ color2” type=”xs:short”/ > 

< /xs: sequence > 

< /xsxomplexType > 

< /xs:element > 



Fig. 7. High-Level XML Schema for Virtual Microscope 



is computed by averaging the pixels of the image at the next higher level of 
granularity. If it is only available at a lower resolution, then the pixels from the 
lower resolution image are used to create the output. 

The digitized microscope slides can also be viewed as a three dimensional 
dataset. Each pixel has x and y coordinates and the resolution is the third 
dimension. The high-level XML Schema of virtual microscope is shown in Fig- 
ure 7. For each pixel in a slide, three short integers are used to represent the 
RGB colors. 

XQuery code for performing the computations is shown in Figure 8. We 
assume that the user is only interested in viewing the image at the highest 
possible resolution level, which means that averaging is never done to produce 
the output image. The structure of this code is quite similar to our previous 
example. Inside an unordered for loop, we use the let statement to compute a 
sequence, and then apply a recursive reduction. 

4.3 Low Level XML Schema and XQuery 

The above XQuery codes for multi-grid virtual microscope and satellite data 
processing specify a query on a high-level abstraction of the actual datasets, 
which eases the development of applications. However, storing XML data in 
such a high-level format will result in unnecessary disk space usage as well as 
large overheads on query processing. For example, storing x and y coordinates 
for each pixel in a regular digitized slide of virtual microscope is not necessary, 
since these values can be easily computed from the meta-data and the offset of 
a pixel. 

In our system, XML files are mapped to flat files by a special mapping service. 
Pixels in each flat file are later partitioned and organized into chunks by data 
distribution and indexing services. A low-level XML Schema file is provided 
to the compiler after partitioning of the datasets to specify the actual data 
layout. Here, the pixels are divided into chunks. Each chunk is associated with 
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unordered ( 

for Si in ($xl to $x2) 
for $j in ($yl to $y2) 

let $p := document(”vmscope.xml”)data/pixel[(x=$i) 
and (y = $j) and (scale > Szl) and (scale < $z2) ] 
return 
< pixel > 

<latitute> {$i} </latitute> 

<longtitute> {$j} </longtitute> 

<summary> { accumulate ($p)} </summary> 

< /pixel> 

) 



define function accumulate (element pixel $p ) 
as element 

{ 

if (empty($p) ) 
then Snull 
else 

let $max:= accumulate(subsequence($p,2) ) 

let $q:= item-at($p,l) 

return 

if ($q/scale < $max/scale) or ($max = Snull) 
then $max 
else $q 



Fig. 8. Multigrid Virtual Microscope Using XQuery 



a bounding box for all pixels it contains, which is specified by a lower bound and 
a higher bound. Within a chunk, the values of pixels are stored consecutively, 
with each pixel occupying three bytes for RGB colors. 

For each application whose XML data is transformed into ADR dataset by 
data distribution and indexing services, we provide several library functions writ- 
ten in XQuery to perform data retrieval. These library functions have a common 
name, getData , but the function parameters are different. Each getData function 
implements a unique selection operation based on its parameters. The getData 
functions are similar to physical operators of a SQL query engine. A physical 
operator of SQL engine takes as input one or more data streams and produces 
an output data stream. In our case, the default input data stream of a getData 
function is the entire dataset, while the output data stream is result of filtering 
the input stream by parameters of the getData function. For example, the get- 
Data function shown in Figure 9 (a) returns pixels whose x and y coordinates 
are equal to those specified by the parameters. The detailed implementation is 
based on the meta-data of the dataset, which is specified by the low-level XML 
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define function getData( $x, $y ) 
return element 

{ 

} 

(a) 

define function getData( $x ) 
return element 

{ 

} 

(b) 

define function getData( $x, $y, $z ) 
return element 

{ 

} 

( c ) 

Fig. 9. getData functions for Multigrid Virtual Microscope 



Schemas. The getData function in Figure 9 (b) requires only one parameter, 
which retrieves pixels with specified a; coordinate. For space reason, the detailed 
implementation of only one getData function is shown here. 

The XQuery code for virtual microscope that calls a getData function is 
shown in Figure 10. This query code is called low-level XQuery and is typi- 
cally generated automatically by our compiler. The XQuery codes described in 
the above section operate on high-level data abstractions and are called high- 
level XQuery. The recursive functions used in both the low-level and high-level 
XQuery are the same. 

The low-level XML Schemas and getData functions are expected to be in- 
visible to the programmer writing the processing code. The goal is to provide 
a simplified view of the dataset to the application programmers, thereby easing 
the development of correct data processing applications. The compiler translat- 
ing XQuery codes obviously has the access to the source code of the getData 
functions, which enables it to generate efficient code. However, an experienced 
programmer can still have access to getData functions and low-level Schemas. 
They can modify the low-level XQuery generated by the compiler, or even write 
their own version of getData functions and low-level XQuery codes. This is the 
major reason why our compiler provides an intermediate low-level query format, 
instead of generating the final executable code directly from high-level codes. 
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unordered ( 

for Si in ($xl to $x2) 
for $j in ($yl to $y2) 

let $p := getData ( Si, $j ) 

where (scale > Szl) and (scale > $z2 ) ] 
return 
<pixel> 

<latitute> {$i} </latitute> 

<longtitute> {$j} </longtitute> 

<summary> { accumulate($p)} </summary> 
</pixel> 

) 

Fig. 10. Multigrid Virtual Microscope Using Low Level XQuery 



5 Compiler Analysis 

In this section, we describe the various analysis, transformations, and code gen- 
eration issues that are handled by our compiler. 

5.1 Overview of the Compilation Problem 

Because the high-level codes shown in Figures 6 and 8 do not reflect any in- 
formation of how the actual layout of data, the first task for our compiler is to 
generate corresponding low-level XQuery codes. 

After such high-level to low-level query transformation, we can generate cor- 
rect codes. However, there are still optimization issues that need to be considered. 
Consider the low-level XQuery code for virtual microscope shown in Figure 10. 
Suppose, we translate this code to an imperative language like C/C++, ignoring 
the unordered directive, and preserving the order of the computation otherwise. 
It is easy to see that the resulting code will be very inefficient, particularly when 
the datasets are large. This is primarily because of two reasons. First, each exe- 
cution of the let expression will involve a complete scan over the dataset, since 
we need to find all data-elements that will belong to the sequence. Second, if 
this sequence involves n elements, then computing the result will require n + 1 
recursive function calls, which again is very expensive. 

We can significantly simplify the computation if we recognize that the com- 
putation in the recursive loop is a reduction operation involving associative and 
commutative operators only. This means that instead of creating a sequence and 
then applying the recursive function on it, we can initialize the output, process 
each element independently, and update the output using the identified associa- 
tive and commutative operators. A direct benefit of it is that we can replace 
recursion by iteration, which reduces the overhead of function calls. However, 
a more significant advantage is that the iterations of the resulting loop can be 
executed in any order. Since such a loop is inside an unordered nested for loop, 
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powerful restructuring transformations can be applied. Particularly, the code re- 
sulting after applying data-centric transformation [9, 10] will only require a single 
pass on the entire dataset. 

Thus, the three key compiler analysis and transformation tasks are: 1) trans- 
forming high-level XQuery codes to efficient low-level query codes, 2) recognizing 
that the recursive function involves a reduction computation with associative 
and commutative operations, and transforming such a recursive function into 
a foreach loop, i.e., a loop whose iterations can be executed in any order, and 
3) restructuring the nested unordered loops to require only a single pass on the 
dataset. 

An algorithm for the second task listed above was presented in our recent 
publication [11]. Therefore, we will only briefly review this issue, and focus on 
the first and the third tasks in the rest of this section. 

5.2 High Level XQuery Transformation 

High-level XQuery provides an easy way to specify operations on high-level ab- 
stractions of dataset. If the low-level details of the dataset is hidden from a pro- 
grammer, a correct application can be developed with ease. However, the perfor- 
mance of the code written in this fashion is likely to be poor, since a programmer 
has no idea how the data is stored and indexed. 

To address this issue, our compiler needs to translate a program expressed 
in the high-level XQuery to low-level XQuery. As described earlier, a low-level 
XQuery program operates on the descriptions of the dataset specified by the 
low-level XML Schemas. Although the recursive functions defined in both high- 
level and low-level XQuery are almost the same, the low-level XQuery calls one 
or more getData functions defined externally. getData functions specify how to 
retrieve data streams according to meta-data of the dataset. A major task for 
the compiler is to choose a suitable getData function to rewrite the high-level 
query. 

The challenges for this transformation are compatibility and performance of 
the resulting code. This requires the compiler to determine: 1) which of the 
getData functions can be correctly integrated, i.e., if a getData function is 
compatible or not, and 2) which of the compatible functions can achieve the 
best performance. 

We will use virtual microscope as an example to further describe the problem. 
As shown in Figure 8, in each iteration, the high-level XQuery code retrieves 
a desired set of elements from the dataset first, then, a recursive function is 
applied on this data stream to perform the reduction operation. There are three 
getData functions provided, each will retrieve an output data stream from the 
entire dataset. The issue is if and how the output stream from a getData functions 
can be used to construct the same data stream as used in the high-level query. 

For a given getData function G with actual arguments X\,X 2 , ■ ■ ■ , ,Xi, we 
define the output stream of G(x 1 , 22 , ■■■Xi) to be 

(si,S2,...,Sfc) 
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Similarly, for a given query Q with loop indices I \ , I 2 , . . . , Ij , we define the 
data stream that is processed in a given iteration to be 

q (h,h, ■ ■ ■ ,h) 

Q 

Let the set of all possible iterations of Q be Iq. We say that a getData 
function G is compatible with the query Q if there exists an affine function 
such that 



V-L , I2 , - * • , Ij €: Iq , 3 x \ , X2 , ■ ■ • , Xi 

such that 

= (x 1 ,x 2 ,...,x i ) 

and 

(si,s 2 , ■ • ■ ,Xj) (h,h, ■ ■ -,Ij) 

G ~ Q 

If a getData function G is compatible with Q , it means that in any iteration 
of the query, we can call this getData function to retrieve a data stream from 
the dataset. Since this data stream is a superset of the desired data stream, we 
can perform another selection on it to get the correct data stream. Here, the 
second selection can be easily performed in memory and without referring to the 
low-level disk layout of the dataset. For the three functions shown in Figure 9, it 
is easy to see that the first two functions are compatible. Their selection criteria 
is either less or equally restrictive to what is used in the high-level query. 

Because of the similarities between physical operators of SQL engine and 
our getData functions, the technique we proposed for translation from high-level 
XQuery to low-level XQuery is based on relational algebra. Relational algebra is 
an unambiguous notation for expressing queries and manipulating relations and 
is widely used in the database community for query optimization. 

We use the following three step approach. First, we compute the relational 
algebra of the high-level XQuery and getData functions. A typical high-level 
XQuery program retrieves desired tuples from an XML file and performs com- 
putations on these tuples. We focus on the data retrieval part. The relational 
algebras of XQuery and the getData functions are shown in Figure 11 (a). 
Here, we use a (f)E to represent selection from the entire dataset E by applying 
restriction /. 

In the second step, we formalize these relational algebras into an equivalent 
canonical form that is easier to compare and evaluate. The canonical form we 
choose is similar to the disjunctive normal form (DNF), where the relations are 
expressed as unions of one or more intersections. Figure 11 (b) shows the equiv- 
alent canonical forms transformed. The actual canonical forms are internally 
represented by trees in our compiler. 

In the third step, we compare the canonical forms of the high-level query 
and getData functions. For a given getData function, if its canonical form is 
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Pi(H query) : 0-(*=*) A (i/=*)a(« cale^Zp) A(.scale<Z 2 )P 
P\(getData(%x )) : <j( x= *)E 
Pi (getData($x, $y)) : ct( x= *)a („=*)£ 

Fl (getDatCl(§X } St/; $2) ) • (T(x=*)A(y = *)A(sca,le=*)P 



(a) Relational algebras for high-level query and getData function 

P 2 (Hquery): (a (x=t) E) n (a^ y= ,)E) n (a (scale>Zi ) P) 
fl(^(scaie<^2 ) P') 

P 2 (getData(%x )) : a^ x=t )E 

P 2 (getData($x,$y )) : U( x=t )P n a^ y= ,)E 

P 2 (getData($x,$y,$z)) : <r( I= ,)£ fl <?(„=»)£ fl a (sco i e= ,)E 

(b) Relational algebras in canonical form 

P3^Query) . (T(scale>Zi)A(scale<Z2)(^(x=*)A(y=*)P) 

(c) Relational algebra for low level query 



Fig. 11. Relational Algebra Based Approach for High-level to Low-level Transforma- 
tion 



an isomorphic subtree of the canonical form of the query, we can say that the 
getData function is compatible with the original query. This is because when 
replacing part of the relational algebra of the high-level query with a getData 
function, the query semantics are maintained. From Figure 11 (b) it is easy to 
see that the first two getData functions are compatible. getData($x,$y,$z) is 
not compatible, because the its selection restriction on $z is equal , while the 
restriction of the query on $2 is > and <. 

The next task is to choose the getData function which will result in the best 
performance. The algorithm we currently use is quite simple. Because applying 
restrictions early in a selection can reduce the number of tuples to be scanned in 
the next operation, a compatible getData function with the most parameters is 
preferred here. Formally, we select the function whose relational algebra in the 
canonical form is the largest isomorphic subtree. As shown in Figure 11 (c), the 
final function we choose is getData($x , $y). The resulting relational algebra for 
low-level XQuery is shown in Figure 11, part (c). Here, the pixels are retrieved 
by calling getData($x, $y) and then performing another selection on the output 
stream by applying the restriction on scale. 
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unordered ( 

for Si in ($xl to $x2) 
for $j in ($yl to $y2) 

foreach element $e in getData($i, $j ) 

if (( $e/scale > Szl ) and ($e/scale < $z2 )) 
Insert $e to the sequence $p 
Initialize the output 
foreach element $e in $p 

Apply the reduction function and update output 
return output 



Fig. 12. Recursion Transformations for Virtual Microscope 



5.3 Reduction Analysis and Transformation 

Now, we have a low-level XQuery code, either generated by our compiler or 
specified directly by an experienced programmer. Our next task is analyzing the 
reduction operation defined in low-level query. The goals of this analysis is to 
generate efficient code that will execute on disk-resident datasets and on parallel 
machines. 

The reductions on tuples that satisfy user-defined conditions are specified 
through recursive functions. Our analysis requires the recursive function to be 
linear recursive, so that it can be transformed it into an iterative version. Our 
algorithm examines the syntax tree of a recursive function to extract desired 
nodes. These nodes represent associative and commutative operations. The de- 
tails of the algorithm are described in a related paper [11]. After extracting 
the reduction operation, the recursive function can be transformed into a fore- 
ach loop. An example of this is shown in Figure 12. This foreach loop can be 
executed in parallel by initializing the output element on each processor. The 
reduction operation extracted by our algorithm can then be used for combining 
the values of output created on each processor. 



5.4 Data Centric Transformation 

Replacing the recursive computation by a foreach loop is only an enabling trans- 
formation for our next step. The key transformation that provides a significant 
difference in the performance is the data- centric transformation , which is de- 
scribed in this section. 

In Figure 12, we show the outline of the virtual microscope code after replac- 
ing recursion by iteration. Within the nested for loops, the let statement and 
the recursive function are replaced by two foreach loops. The first of these loops 
iterates over all elements in the document and creates a sequence. The second 
foreach loop performs the reduction by iterating over this sequence. 



Supporting High-Level Abstractions through XML Technology 143 



for $i in (Sxl to $x2) 
for $j in ($yl to $y2) 

Initialize output [i,j] 

foreach element $e in //data/chunks/vmpixal 
if (//data/scale > $zl) 
and (//data/scale < $z2) 

$i =//data/chunks/low/x +( offset div ( 512...)) 

$j =//data/chunks/low/y + (offset % (512...)) 
if ($i > Sxl) and (Si < $x2) and 
($j > Syl) and ($j < Sy2) 

Apply the reduction function and update output [i,j] 



Fig. 13. Data-Centric Transformations on Virtual Microscope Code 



The code, as shown here, is very inefficient because of the need for iterating 
over the entire dataset a large number of times. If the dataset is disk-resident, 
it can mean extremely high overhead because of the disk latencies. Even if the 
dataset is memory resident, this code will have poor locality, and therefore, poor 
performance. 

Since the input dataset is never modified, it is clearly possible to execute 
such code to require only a single pass over the dataset. However, the chal- 
lenge is to perform such transformation automatically. We apply the data-centric 
transformation that has previously been used for optimizing locality in scientific 
codes [9, 10]. The overall idea here to iterate over the available data elements, 
and then find and execute the iterations of the nested loop in which they are exe- 
cuted. As part of our compiler, we apply this transformation to the intermediate 
code we obtain after removing recursion. The results of performing data-centric 
transformation on the virtual microscope are shown in Figure 13. This code 
requires only one scan of the entire dataset. 

6 Experimental Results 

This section reports experimental data from our current compilation system. 
We used the two real applications, satellite and mg-vscope, discussed earlier 
in this paper. The cluster we used had 700 MHz Pentium machines connected 
through Myrinet LANai 7.0. We ran our experiments on 1, 2, 4, 8 nodes of the 
cluster. 

The goal of our experiments was to demonstrate that even with high-level 
abstractions and a high-level language like XQuery, our compiler is able to gener- 
ate reasonably efficient code. The compiler generated codes for our two applica- 
tions were compared against versions whose performance was reported in earlier 
work [9]. These versions were generated by a compiler starting from a data par- 
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Fig. 14. Parallel Performance of satellite 




Number 01 Processors 



Fig. 15. Parallel Performance for mg-vseope 



allel dialect of Java, and were further manually optimized. For our discussion, 
the versions generated by our current compiler are referred to as comp and the 
baseline version is referred to as manual. 

For the mg-vseope application, the dataset we used contains an image of 
29,238 x 28,800 pixels collected at 5 different magnification levels, which cor- 
responds to 3.3 GB of data. The query we used involves processes a region 
of 10,000 x 10,000 pixels, which corresponds to reading 627 MB and gener- 
ating an output of 400 MB. The entire dataset for the satellite application 
contains data for the entire earth at a resolution of l/128 t?i of a degree in lati- 
tude and longitude, over a period of time that covers nearly 15, 000 time steps. 
The size of the dataset is 2.7 GB. The query we used traverses a region of 
15, 000 x 10, 000 x 10, 000 which involves reading 446 MB to generate an output 
of 50 MB. 
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The results from satellite are presented in Figure 14. The results from 
mg-vscope are presented in Figure 15. For both the applications and on 1, 2, 
4, and 8 nodes, the comp versions are slower. However, the difference in perfor- 
mance is only between 5% and 8% for satellite and between 18% and 22% for 
mg-vscope. The speedups on 8 nodes is around 6 for both versions of satellite 
and around 4 for both versions of mg-vscope. The reason for limited speedups 
is the high communication volume. 

To understand the differences in performance, we carefully compared the 
comp and manual versions. Our analysis shows that a number of additional simple 
optimizations can be implemented in the compiler to bridge the performance 
difference. These optimizations are, function inlining, loop invariant code motion, 
and elimination of unnecessary copying of buffers. 

7 Conclusions 

In this paper, we have described a system that offers an XML based front-end 
for storing, retrieving, and processing flat-file based scientific datasets. With the 
use of aggressive compiler transformations, we support high-level abstractions for 
a dataset, and hide the complexities of the low-level layout from the application 
developers. Processing on datasets can be expressed using XQuery, the recently 
developed XML Query language. Our preliminary experimental results from two 
applications have shown that despite using high-level abstractions and a high- 
level language like XQuery, the compiler can generate efficient code. 
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Abstract. We describe two applications of our HP Java language for 
parallel computing. The first is a multigrid solver for a Poisson equation, 
and the second is a CFD application that solves the Euler equations for 
inviscid flow. We illustrate how the features of the HPJava language allow 
these algorithms to be expressed in a straightforward and convenient way. 
Performance results on an IBM SP3 are presented. 



1 Introduction 

The HPJava project [10] has developed translator and libraries for a version of 
the Java language extended to support parallel and scientific computing. Version 
1.0 of the HPJava software was released earlier this year as open source software. 
This paper reports experiences using HPJava for applications, with some bench- 
mark results. A particular goal here is to argue the case that our programming 
model is flexible and convenient for writing non-trivial scientific applications. 

HPJava extends the standard Java language with support for “scientific” 
multidimensional arrays (multiarrays), and support for distributed arrays , famil- 
iar from High Performance Fortran (HPF) and related languages. Considerable 
work has been done on adding features like these to Java and C++ through class 
libraries (see for example [17], [8], [15]). This seems like a natural approach in 
an object oriented language, but the approach has some limits: most obviously 
the syntax tends to be inconvenient. Lately there has been widening interest in 
adding extra syntax to Java for multiarrays, often through preprocessors 1 . 

From a parallel computing point view of an interesting feature of HPJava 
is its spartan programming model. Although HPJava introduces special syntax 
for HPF-like distributed arrays, the language deliberately minimizes compiler 
intervention in manipulating distributed data structures. In HPF and similar 
languages, elements of distributed arrays can be accessed on essentially the 
same footing as elements of ordinary (sequential) arrays — if the element be- 
ing accessed resides on a different processor, some run-time system is probably 
invoked transparently to “get” or “put” the remote element. HPJava does not 
have this feature. It was designed as a framework for development of explicit li- 
braries operating on distributed data. In this mindset, the right way of accessing 

1 See, for example, the minutes of recent meetings at [12]. 
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remote data is to explicitly invoke a communication library method to get or put 
the data. 

So HPJava provides some special syntax for accessing locally held elements 
of multiarrays and distributed arrays, but stops short of adding special syn- 
tax for accessing non-local elements. Non-local elements can only be accessed 
by making explicit library calls. The language attempts to capture the success- 
ful library-based approaches to SPMD parallel computing — it is in very much in 
the spirit of MPI, with its explicit point-to-point and collective communications. 
HPJava raises the level of abstraction a notch, and adds excellent support for 
development of libraries that manipulate distributed arrays. But it still exposes 
a multi-threaded, non-shared-memory, execution model to to programmer. Ad- 
vantages of this approach include flexibility for the programmer, and ease of 
compilation, because the compiler does not have to analyse and optimize com- 
munication patterns. 

The basic features of HPJava have been described in several earlier publica- 
tions. In this paper we will jump straight into a discussion of the implementa- 
tion of some representative applications in HPJava. After briefly reviewing the 
compilation strategy in section 2, we illustrate typical patterns of HPJava pro- 
gramming through a multigrid algorithm in section 3. This section also serves 
to review basic features of the langauge. Section 4 describes another substantial 
HPJava application — a CFD code- and highlights additional common coding 
patterns. Section 5 collects together benchmark results from these applications. 

1.1 Related Work 

Other ongoing projects that extend the Java language to directly support scien- 
tific parallel computation include Titanium [3] from UC Berkeley, Timber/Spar 
[2] from Delft University of Technology, and Jade [6] from University of Illinois 
at Urbana-Champaign. 

Titanium adds a comprehensive set of parallel extensions to the Java lan- 
guage. For example it includes support for a shared address space, and does 
compile-time analysis of patterns of synchronization. This contrasts with our 
HPJava, which only adds new data types that can be implemented “locally”, and 
leaves all interprocess communication issues to the programmer and libraries. 

The Timber project extends Java with the Spar primitives for scientific pro- 
gramming, which include multidimensional arrays and tuples. It also adds task 
parallel constructs like a foreach construct. 

Jade focuses on message-driven parallelism extracted from interactions be- 
tween a special kind of distributed object called a Chare. It introduces a kind of 
parallel array called a ChareArray. Jade also supports code migration. 

HPJava differs from these projects in emphasizing a lower-level (MPI-like) 
approach to parallelism and communication, and by importing HPF-like distri- 
bution formats for arrays. Another significant difference between HPJava and 
the other systems mentioned above is that HPJava translates to Java byte codes, 
relying on clusters of conventional JVMs for execution. The systems mentioned 
above typically translate to C or C++. While HPJava may pay some price in 



Applications of HPJava 149 



performance for this approach, it tends to be more fully compliant with the 
standard Java platform (e.g. it allows local use of Java threads, and APIs that 
require Java threads). 



2 Features of the HPJava System 

HPJava adds to Java a concept of multi-dimensional arrays called “multiarrays” 
(consistent with proposals of the Java Grande Forum). To support parallel pro- 
gramming, these multiarrays are extended to “distributed arrays”, very closely 
modelled on the arrays of High Performance Fortran. The new distributed data 
structures are cleanly integrated into the syntax of the language (in a way that 
doesn’t interfere with the existing syntax and semantics of Java — for example 
ordinary Java arrays are left unaffected). 

In the current implementation, the source HPJava program is translated to 
an intermediate standard Java file. The preprocessor that performs this task is 
reasonably sophisticated. For example it performs a complete static semantic 
check of the source program, following rules that include all the static rules of 
the Java Language Specification [9]. So it shouldn’t normally happen that a pro- 
gram accepted by the HPJava preprocessor would be rejected by the backend 
Java compiler. The translation scheme depends on type information, so we were 
essentially forced to do a complete type analysis for HPJava (which is a superset 
of standard Java). Moreover we wanted to produce a practical tool, and we felt 
users would not accept a simpler preprocessor that did not do full checking. 

The current version of the preprocessor also works hard to preserve line- 
numbering in the conversion from HPJava to Java. This means that the line 
numbers in run-time exception messages accurately refer back to the HPJava 
source. Clearly this is very important for easy debugging. 

A translated and compiled HPJava program is a standard Java class file, 
ready for execution on a distributed collection of JIT-enabled Java Virtual Ma- 
chines. All externally visible attributes of an HPJava class — e.g. existence of 
distributed-array-valued fields or method arguments — can be transparently re- 
constructed from Java signatures stored in the class file. This makes it possible 
to build libraries operating on distributed arrays, while maintaining the usual 
portability and compatibility features of Java. The libraries themselves can be 
implemented in HPJava, or in standard Java, or as JNI interfaces to other lan- 
guages. The HPJava language specification documents the mapping between 
distributed arrays and the standard-Java components they translate to. 

Currently HPJava is supplied with one library for parallel computing — a Java 
version of the Adlib library of collective operations on distributed arrays [18]. 
A version of the mpiJava [1] binding of MPI can also be called directly from 
HPJava programs. Of course we would hope to see other libraries made available 
in the future. 
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3 A Multigrid Application 

The multigrid method [5] is a fast algorithm for solution of linear and nonlinear 
problems. It uses a hierarchy or stack of grids of different granularity (typically 
with a geometric progression of grid-spacings, increasing by a factor of two up 
from finest to coarsest grid). Applied to a basic relaxation method, for example, 
multigrid hugely accelerates elimination of the residual by restricting a smoothed 
version of the error term to a coarser grid, computing a correction term on the 
coarse grid, then interpolating this term back to the original fine grid. Because 
computation of the correction term on the fine grid can itself be handled as 
a relaxation problem, the strategy can be applied recursively all the way up the 
stack of grids. 

In our example, we apply the multigrid scheme to solution of the two- 
dimensional Poisson equation. For the basic, unaccelerated, solution scheme we 
use red-black relaxation. An HP Java method for red-black relaxation is given 
in Figure 1. This looks something like an HPF program with different syntax. 
One obvious difference is that the base language is Java instead of Fortran. The 
HP Java type signature double [[-,-]] means a two dimensional distributed 
array of double numbers 2 . So the arguments passed to the method relaxO will 
be distributed arrays 

The inquiry rng() on the distributed array f returns the Range objects 
x, y. These describe the distribution format of the array index (for the two 
dimensions). 

The HP Java overall construct operates like a forall construct, with one 
important difference. In the HP Java construct one must specify how the iteration 
space of the parallel loop is distributed over processors. This is done by specifying 
a Range object in the header of the construct. 

The variables i, j in the figure are called distributed index symbols. Dis- 
tributed indexes are scoped by the overall constructs that use them. They are 
not integer variables, and there is no syntax to declare a distributed index ex- 
cept through an overall construct (or an at construct — see later). The usual Java 
scoping rules for local variables apply: one can’t for example use i as the index 
of an overall if there is already a local variable i in scope -the compiler doesn’t 
allow it. 

An unusual feature of the HP Java programming model is that the subscripts 
in a distributed array element reference usually must be distributed index sym- 
bols. And these symbols must be distributed with the essentially same format 
as the arrays they subscript. As a special case, shifted index expressions like 
i+1 are allowed as subscripts, but only if the distributed array was created with 
ghost regions. Information on ghost regions, along with other information about 

2 The main technical reason for using double brackets here is that it is useful to support 
an idea of rank-zero distributed arrays : these are “distributed scalars”, which have 
a localization (a distribution group ) but no index space. If we used single brackets for 
distributed array type signatures, then double [] could be ambiguously interpretted 
as either a rank-zero distributed array or an ordinary Java array of doubles. 
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static void relax(int itmax, int np, 

double [[-,-] ] u, double [[-,-] ] f) { 

Range x = f.rng(O), y = f.rng(l); 

for (int it = 1 ; it <= itmax * 2; it++) { 

Adlib . writeHalo (u) ; 

overall (i = x for 1 : np - 2) 

overall (j = y for 1 + (i‘ +it) '/, 2 : np-2 : 2) { 

u [i, j] = 0.25 * (f [i, j] + 

u [i - 1 , j] + u [i + 1 , j] + 
u [i, j - 1] + u [i, j + 1] ) ; 

> 

} 

> 



Fig. 1 . Red black relaxation on array u 



distribution format, is captured in the Range object associated with the array 
dimension or index. 

These requirements ensure that a subscripting operation in an overall con- 
struct only accesses locally held elements. They place very stringent limitations 
on what kind of expression can appear as a subscript of a distributed array. We 
justify this by noting that this restricted kind of data parallel loop is a frequently 
recurring pattern in SPMD programs in general, and it is convenient to have it 
captured in syntax. A glance at the full source of the applications described in 
this paper should make this claim more plausible 3 . 

The method Adlib .writeHaloO is a communication method (from the li- 
brary called Adlib). It performs the edge-exchange to fill in the ghost regions. As 
emphasized earlier, the compiler is not responsible for inserting communications- 
this is the programmer’s responsibility. We assume this should be acceptable to 
programmers currently accustomed to using MPI and similar libraries for com- 
munication. 

Because of the special role of distributed index symbols in array subscripts, 
it is best not to think of the expressions i, j, i+1, etc, as having a numeric 
value: instead they are treated as a special kind of thing in the language. We use 

3 When less regular patterns of access are necessary, the approach depends on the 
locality of access: if accesses are irregular but local one can extract the locally-held 
blocks of the distributed array by suitable inquiries, and operate on the blocks as 
in an ordinary SPMD program; if the accesses are non-local one must use suitable 
library methods for doing irregular remote accesses. 
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the notation i‘ to extract the numeric global index associated with i, say 4 . In 
particular, use of this expression in the modulo 2 expression in the inner overall 
construct in Figure 1 implements the red-black pattern of accesses. 

This completes the description of most “non-obvious” features of HP Java 
syntax. Remaining examples in the paper either recycle these basic ideas, or just 
introduce new library routines; or they import relatively uncontroversial syntax, 
like a syntax for array sections. 

Figure 2 visualizes the “restrict” operation that is used to transfer the error 
term from a fine grid to a coarse grid. The HP Java code is given in Figure 3. The 
restrict operation here computes the residual term at sites of the fine grid with 
even coordinate values, then sends these values to the coarse grid. In multigrid 
the restricted residual from the fine grid becomes the RHS of a new equation 
on the coarse grid. The implementation uses a temporary array tf which should 
be aligned with the fine grid (only a subset of elements of this array are actu- 
ally used). The last line introduces two new features: distributed array sections, 
and the library function Adlib.remapO. Sections work in HP Java in much the 
same way as in Fortran- one small syntactic difference is that they use double 
brackets. The bounds in the fc section ensure that edge values, corresponding 
to boundary conditions, are not modified. The stride in the tf section ensures 
only values with even subscripts are selected. The Adlib.remapO operation is 
needed because in general there is no simple relation between the distribution 
format of the fine and coarse grid — this function introduces the communications 
necessary to perform an assignment between any two distributed arrays with 
unrelated distribution format. As another example, the interpolation code of 
Figure 4 performs the complementary transformation from the coarse grid to 
the fine grid. 

4 Early versions of the language used a more conventional “pseudo-function” syntax 
rather than the “primed” notation. The current syntax arguably makes expressions 
more readable, and emphasizes the unique status of the distributed index in the 
language. 
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static void restr(int npc, int npf , 

double fc [[-,-]], double uf [[-,-]], 
double ff [[-,-]], double tf [[-,-]]) { 



Range xf = ff.rng(O), ff = ff.rng(l); 
int nc = npc - 1 , nf = npf - 1 ; 

Adlib. writeHalo(uf) ; 



overall (i = xf for 2 : nf - 2 : 2) 

overall (j = yf for 2 : nf - 2 : 2) 

tf [i, j] += 2.0 * 

(ff [i, j] - 4.0 * uf [i, j] + 
uf [i - 1 , j] + uf [i + 1 , j] + 

uf [i, j - 1] + uf [i, j + 1] ) ; 



Adlib. remap (fc 
tf 



J 



[[1 

[[2 



nc - 1 , 1 : nc - 1] ] , 

nf - 2 : 2 , 2 : nf - 2 : 2] ] ) ; 



Fig. 3. HPJava code for restrict operation 



static void interp(int npf, double [[-,-] ] uc, 

double [[-,-] ] uf , double [[-,-]] tf) { 

Range xf = uf.rng(O), yf = uf.rng(l); 

int nf = npf - 1 ; 

Adlib. remap (tf [ [0 : nf : 2 , 0 : nf : 2] ] , uc) ; 

Adlib. writeHalo(tf) ; 

overall (i = xf for 1 : nf - 1 : 2) 

overall (j = yf for 2 : nf - 2 : 2) 

uf [i, j] += 0.5 * (tf [i - 1, j] + tf [i + 1, j]); 

overall (i = xf for 2 : nf - 2 : 2) 

overall (j = yf for 1 : nf - 1 : 2) 

uf [i, j] += 0.5 * (tf [i, j - 1] + tf [i, j + 1] ) ; 

y 



Fig. 4. HPJava code for interpolate operation 



The basic pattern here depends only on the geometry of the problem. More 
complex (perhaps non-linear) equations with similar geometry could be tackled 
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by similar code. Problems with more dimensions can also be programmed in 
a similar way. 



4 A CFD Application 



In this section we discuss another significant HP Java application code. This code 
solves the Euler equations for inviscid fluid flow by a finite volume approach. 
One version of this code, viewable at http://www.hp java.org/demo.html also 
has a novel parallel GUI implemented in HP Java 5 . 

The Euler equations are a family of conservation equations, relating the time 
rates of change of various densities to divergences of associated flow fields. In 
two dimensions there are four densities — the ordinary matter density, densities of 
the two components of momentum, and the energy density. The Euler equations 
can be summarized as a conservation equation for four-component vectors U, f 
and g: 



dU df dg n 

b — + — = 0 

dtdx dy 



(1) 



The flow variables (/, g) are related to the dependent variables U by simple (but 
non-linear) algebraic equations. So the set of differential equations is closed. 
Two important quantities that figure in the equations are the pressure, p, and 
the enthalpy per unit mass, H , which can be computed from the components 
of U using the equations of state for the fluid. 



4.1 Discretization and Numerical Integration 



The system of partial differential is discretized by a finite volume approach — 
see for example [7] or [11]. Space is divided into a series of quadrilateral (but 
not necessarily rectangular) cells labelled (i, j). This reduces the PDEs to a large 
coupled system of ordinary differential equations. These are integrated by a vari- 
ant of the well-known 4th order Runge Kutta scheme. A single time-step involves 
four stages like: 



r'.j = U id - a-^—Rij(U) 



( 2 ) 



where cn is a fractional value characteristic of the scheme, and 



Ri,j(U)= ^2 ( fSy-gSx ) (3) 

faces of cell 

Here fiij is the volume a cell and Sx, Sy are coordinate differences between 
end-points of the face. Since the dependent variables and fluxes are defined at 

5 The code is adapted from a version of an original Java code by David Oh of MIT [16], 
modified by Saleh Elmohamed and Mike McMahon of Syracuse University. It is 
almost identical to the CFD benchmark in the Java Grande Benchmark suite, which 
came from the same original source. 



Applications of HPJava 155 



cell centers, their values at a cell face in equation 3 is are approximated as the 
average of the values from the two cells meeting at the face. 

So at its most basic level the program for integrating the Euler equations 
consists of a series of steps like: 

1. Calculate p, H from current U (via equations of state). 

2. Calculate / from U , p , H. 

3. Calculate g from U, p , H. 

4. Calculate R from /, g. 

5. Update U. 

To parallelize in HPJava, the discretized field variables are naturally stored in 
distributed arrays. All the steps above become overall nests. As a relatively 
simple case, the operation to calculate / (step 2) looks like: 

Statevector [[-,-]] U, f, ... ; 
double [[-,-]] p, H, ... ; 

overall (i = x for 0 : imax) 

overall (j = x for 0 : imax) { 



} 



double u = 


U [i 


. j] 


• b / 


u 


li, jl -a ; 


f [i , j] • a 


= U 


li, 


j] -b 


; 




f [i , j] -b 


= U 


li, 


j] -b 


* 


u + p [i 


f [i , j] . c 


= U 


li, 


j] -c 


* 


u ; 


f [i, j] -d 


= H 


li, 


j] * 


U 


J 



// velocity component 



j] ; 



The four fields a, b, c, d of Statevector correspond to the four conserved 
densities. A general observation is that the bodies of overall statements are now 
more complex than those in the (perhaps artificially simple) Poisson equation 
example of the previous section. We expect this will often happen in “real” 
applications. It is good for HPJava, because it means that various overheads 
associated with starting up a distributed loop are amortized better. 

Another noteworthy thing is that these overall statements work naturally 
with aligned data — no communication is needed here. Out of the five stages 
enumerated above, only computation of R involves non-local terms (formally 
because of the use of averages across adjacent cells for the flow values at the 
faces). The code can be written easily using ghost regions, shifted indices, and 
the writeHalo () operation. Again it involves a single overall nest with a long 
body. A much-ellided outline is given in Figure 5. The optional arguments wlo, 
whi to Adlib . writeHalo () define the widths of the parts ghost regions that 
need updating (the default is to update the whole of the ghost regions of the 
array, whatever their width). In the current case these vectors both have value 
[1 , 1] because shifted indices displace one site in positive and negative x and 
y directions. 
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Adlib. writeHalo (f , wlo, whi) ; 

Adlib. writeHalo(g, wlo, whi) ; 

overall (i = x for 1 : imax - 1) 

overall (j = y for 1 : jmax - 1) { 

. . . Set fields of r [i, j] to zero . . . 



// East face 



hy = 0.5 * 


(ynode 


[i, 


j] - ynode 


[i 


. j 


r [i, j] .a 


+= hy * 


(f 


[i , j] .a + 


f 


[i + 


r [i, j] .b 


+= hy * 


(f 


[i , j ] . b + 


f 


[i + 


r [i, j] .c 


+= hy * 


(f 


[i , jl . c + 


f 


[i + 


r [i, j ] . d 


+= hy * 


(f 


[i, j] .d + 


f 


[i + 


hx = 0.5 * 


(xnode 


[i, 


j] - xnode 


[i 


. j 


r [i, j] .a 


-= hx * 


(g 


[i , j] . a + 


f 


[i + 


r [i, j] .b 


-= hx * 


(g 


[i , j ] . b + 


f 


[i + 


r [i, j] .c 


-= hx * 


(g 


[i, j] .c + 


f 


[i + 


r [i, j] .d 


-= hx * 


(g 


[i, j] .d + 


f 


[i + 



- 1 ]) ; 

1, j] -a) ; 
1. j] -b) ; 
1. j] -c) ; 
1. j].d> ; 

- l]) ; 

1. j] -a) ; 
1. j] -b) ; 
1. j] -c) ; 
1. j]-d) ; 



... Add similar contributions for S, W, N faces ... 

} 



Fig. 5. Outline of computation of R 



The arrays xnode and ynode hold coordinates of the cell vertices. Because 
these arrays are constant through the computation, the ghost regions of these 
arrays are initialized once during startup. 

We will briefly discuss two other interesting complications: handling of so- 
called artificial viscosity , and imposition of boundary conditions. 

Artificial viscosity (or artifical smoothing) is added to damp out a numerical 
instability in the Runge Kutta time-stepping scheme, which otherwise causes 
unphysical oscillatory modes associated with the discretization to grow. An ac- 
cepted scheme adds small terms proportional to 2nd and 4th order finite differ- 
ence operators to the update of U. From the point of view of HP Java program- 
ming one interesting issue is that 4th order damping implies an update stencil 
requiring source elements offset two places from the destination element (unlike 
Figure 5, for example, where the maximum offset is one). This is handled by 
creating the U array with ghost regions of width 2. 

Implementing numerically stable boundary conditions for the Euler equations 
is non-trivial. In our implementation the domain of cells is rectangular, though 
the grid is distorted into an irregular pipe profile by the choice of physical coor- 
dinates attached to grid points (xnode, ynode distributed arrays). HP Java has 
an additional control construct called at , which can be used to update edges (it 
has other uses). The at statement is a degenerate form of the overall statement. 
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It only “enumerates” a single location in its specified range. To work along the 
line x = 0, for example, one may write code like: 

at(i = x [0]) 

overall(j = y for 1 : jmax - 1) { 

. . . assign U [i , j] in terms of U [i + 1 , j] , etc . . . 

> 

The actual code in the body is a fairly complicated interpolation based on Rie- 
mann invariants. In general access to U [i+l,j] here relies on ghost regions 
being up-to-date, exactly as for an index scoped by an overall statement. 



5 Benchmark Results 

For the two applications described above, we have sequential and parallel pro- 
grams to compare performance. The sequential programs were written in Java 
and/or Fortran 95. The parallel programs, of course, were written in HPJava. 
For multigrid we also compare with an available HPF code (taken from [4]). 

The experiments were performed on the SP3 installation at Florida State 
University. The system environment for SP3 runs were as follows: 

— System: IBM SP3 supercomputing system with AIX 4.3.3 operating system 
and 42 nodes. 

— CPU: A node has Four processors (Power3 375 MHz) and 2 gigabytes of 
shared memory. 

— Network MPI Settings: Shared “cssO” adapter with User Space(US) commu- 
nication mode. 

— Java VM: IBM ’s JIT. 

— Java Compiler: IBM J2RE 1.3.1. 

For best performance, all sequential and parallel Fortran and Java codes were 
compiled using -05 or -03 with -qhot or -O (i.e. maximum optimization) flag. 

5.1 Multigrid Results 

First we present some results for the the computational kernel of the multigrid 
code, namely unaccelerated red-black relaxation algorithm of Figure 1. Figure 6 
gives our results for this kernel on a 512 by 512 matrix. The results are encour- 
aging. The HPJava version scales well, and eventually comes quite close to the 
HPF code (absolute megaflop performances are modest, but this feature was 
observed for all our codes, and seems to be a property of the hardware) 6 . 

The flat lines at the bottom of the graph give the sequential Java and Fortran 
performances, for orientation. We did not use any auto parallelization feature 
here. 



We do not know why the HPJava result on 25 processors appears to be below the 
general trend. However the result was repeatable. 
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Laplace Equation using Red-black Relaxation 

512x512 




0-0 HPF 
A- A HPJava 

Fortran 

— Java 



Fig. 6. Red-black relaxation of two dimensional Laplace equation with size of 512 2 



Corresponding results for the complete multigrid code are given in Figure 7. 
The results here are not as good as for simple red-black relaxation- both HPJava 
speed relative to HPF, and the parallel speedup of HPF and HPJava are less 
satisfactory. 

The poor performance of HPJava relative to Fortran in this case can be 
attributed largely to the naive nature of the translation scheme used by the 
current HPJava system. The overheads are especially significant when there are 
many very tight overall constructs (with short bodies). We saw several of these 
in section 3. Experiments done elsewhere [13] lead us to believe these overheads 
can be reduced by straightforward optimization strategies which, however, are 
not yet incorporated in our source-to-source translator'. 

The modest parallel speedup of both HPJava and HPF is due to communi- 
cation overheads. The fact that HPJava and HPF have similar scaling behavior, 
while absolute performance of HPJava is lower, suggests the communication li- 
brary of HPJava is slower than the communications of the native SP3 HPF 
(otherwise the performance gap would close for larger numbers of processors). 
This is not too surprising because Adlib is built on top of a portability layer 
called mpjdev , which is in turn layered on MPI. We assume the SP3 HPF is 
more carefully optimized for the hardware. Of course the lower layers of Adlib 

7 There are also likely to be inherent penalties in using a JVM vs an optimizing 
Fortran compiler, but other experiments suggest these overheads should be smaller 
than what we see here. The communication overheads are probably aggravated by 
a choice we made in the data distribution format in these experiments. All levels are 
distributed blockwise. A better choice may be to distribute only the finest levels, 
and keep the coaser levels sequential. This doesn’t require any change to the main 
code — only to initialization of the grid stack. However this wasn’t what was done in 
these experiments. 
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Multigrid Solver 

512x512 




0-0 HPF 
A- -A HPJava 



Fig. 7. Multigrid solver with size of 512 2 



could be ported to exploit low-level features of the hardware (we already did 
some experiments in this direction, interfacing Java to LAPI [14]). 

5.2 CFD Results 

Figure 8 and gives some performance results for a version of the CFD code. The 
speedup results are quite reasonable, even for small problem sizes. Presumably 
this reflects the intrinsically greater granularity of this problem, compared with 
the multigrid case. (In this case unfortunately we don’t have a Fortran version 
to compare with.) 

6 Discussion 

We illustrated, by a detailed discussion of the coding of two parallel applica- 
tions, that the parallel primitives introduced in HPJava are a good match to 
the requirements of various applications. The limitations imposed on distributed 
control constructs like overall, and especially the strict rules for subscripting dis- 
tributed arrays, may look strange from a language design perspective. But these 
features are motivated by patterns observed in practical parallel programs. 

In particular the language provides a good framework for the development 
of SPMD libraries operating on distributed arrays. The collective operations 
of high-level libraries like Adlib, operating directly on distributed arrays, ab- 
stract and generalize the popular collective operations of MPI and other SPMD 
libraries. They also follow in the spirit of the array intrinsics and libraries of 
Fortran 90/95 and HPF. The language resembles HPF in various ways. But the 
programming model is closer to the MPI style. MPI programming seems to have 
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Fig. 8. CFD with size of 256 2 



been more popular in practice than HPF, perhaps because it gives the program- 
mer control over communication, and it allows the programmer to estimate the 
cost of his program by looking at the code. We claim these as advantages for 
HP Java, too. 

In its current stage of development HP Java, like HPF, seems most naturally 
suited for problems with some regularity. This is not to say that more irregular 
problems can’t be tackled. But doing so will at least need more specialized 
communication library support. 

We have also shown that the performance of the initial implementation of HP- 
Java is quite promising 8 . The current implementation provides full functionality, 
but it has not been seriously optimized. There is scope for dramatic improve- 
ments in efficiency [13] 
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Abstract. This paper introduces a new primitive data type, hierarchi- 
cally tiled arrays (HTAs), which could be incorporated into conventional 
languages to facilitate parallel programming and programming for lo- 
cality. It is argued that HTAs enable a natural representation for many 
algorithms with a high degree of locality. Also, the paper shows that, 
with HTAs, parallel computations and the associated communication 
operations can be expressed as array operations within single threaded 
programs. This, is then argued, facilitates reasoning about the resulting 
programs and stimulates the development of code that is highly readable 
and easy to modify. The new data type is illustrated using examples writ- 
ten in an extended version of MATLAB. 



1 Introduction 

This paper introduces a new primitive data type which could be incorporated 
into conventional languages to facilitate parallel programming and programming 
for locality. This new data type facilitates the representation and manipulation of 
arrays that are organized as a hierarchy of tiles. These hierarchically tiled arrays 
(HTAs) are a generalization of the recursively blocked arrays arising in some 
linear algebra algorithms with a high degree of locality. Our proposal is to use 
HTAs to facilitate the expression of both locality and parallelism. In a nutshell, 
our idea is to distribute the outermost tiles of a hierarchically tiled array for 
parallelism, and used the inner tiles for locality and message aggregation. In the 
case of sequential programs all tile levels will be used for locality. 

* This work is supported in part by the Defense Advanced Research Project Agency 
under contract NBCH30390004. This work is not necessarily representative of the 
positions or policies of the Army or Government. 
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The two main sources of inspiration for this project were the extensive body 
of work on blocked linear algebra algorithms [6, 2] and two recently proposed 
languages, Co- Array Fortran [12] and Unified Parallel C (UPC) [3]. Our proposal 
follows these two languages in that it represents communication explicitly as 
array assignments. The use of array assignments to represent communication 
has at least two advantages over the library-based, approach of MPI [5]. First, 
thanks to APL [8] and Fortran 90 we have at our disposal a wealth of powerful 
array operators that can serve to unify and simplify the many communication 
and collective operations of MPI. Second, making the operations part of the 
language enables compiler support that simplifies the notation and improves 
error detection. 

We, however, do not follow Co- Array Fortran and UPC in the use of the 
SPMD programming paradigm. Instead, our proposal resembles the program- 
ming model of the old SIMD machines, but instead of limiting the parallelism to 
simple arithmetic or logic array operations, we take advantage of the MIMD na- 
ture of todays parallel machines and allow in the expression of parallelism the use 
of complex array operations represented as user-defined functions. Abandoning 
the SPMD model has the drawback of removing some control on the parallelism 
from the programmer, but the single thread programming model has the great 
advantage of enforcing structure and leading to programs that are more readable 
and easier to develop and maintain. Furthermore, we expect that much of the 
potential loss of performance can be avoided with relatively simple compiler and 
run-time techniques. 

Our approach differs from that of the High Performance Fortran [7, 9] in 
that it makes all communication and array distribution explicit and therefore 
it requires much less from the compiler than High Performance Fortran. Al- 
though making communication explicit complicates programming there is no 
better alternative at this time given the failure of High Performance Fortran. 
Furthermore, languages for parallel programming with explicit communication 
will always be necessary much in the same way that assembly language program- 
ming is still necessary today for conventional programming. The availability of 
a lower level language is useful as a fall back position whenever the compiler fails 
to do the right thing and as a means to experiment with alternative solutions 
that can later be incorporated into a compiler. 

Hierarchically tiled arrays can be easily incorporated into several program- 
ming languages including Fortran 90, APL, and MATLAB. In this paper we 
focus on extending MATLAB with hierarchically tiled arrays for two main rea- 
sons. First is that an extended MATLAB system would make a great tool for 
prototyping parallel programs. Such a tool is sorely needed and although many 
MATLAB programmers may not be interested in parallelism, we believe that 
many parallel programmers would be interested in a good prototyping tool. The 
second reason is that MATLAB has many features that make it a convenient 
platform for a first implementation of our ideas. In the rest of this paper, we 
describe hierarchically tiled arrays (Section 2), present mechanism for their rep- 
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Fig. 1 . Two tiled arrays Fig. 2. A partitioned array 



reservation in memory (Section 3) and then illustrate their use in programming 
for locality (Section 4) and parallelism (Section 5). 

2 Hierarchically Tiled Arrays 

In this section we define hierarchically tiled arrays (Section 2.1), and discuss 
how to build them (Section 2.2), access their components (Section 2.3), and how 
they can be used in expressions and values assigned to them (Section 2.4). 

2.1 Definition of Hierarchically Tiled Array 

We define a tiled array as an array that is partitioned into subarrays in such 
a way that adjacent subarrays have the same size along the dimension of adja- 
cency. Although the literature usually assumes that array tiles have the same 
shape (Fig. 1(a)), we do not require this in our definition because there are 
important cases where using tiles of different sizes (Fig. 1(b)) is advantageous. 
Notice that our definition implies that m-dimensional arrays are partitioned by 
(to — l)-dimensional hyper planes that are perpendicular to one of the dimen- 
sions. Furthermore, ’’randomly” partitioned arrays such as that shown in Fig. 2 
do not fall under our definition of tiled arrays. 

We define hierarchically tiled arrays (HTAs) as tiled arrays where each 
tile is either an unpartitioned array or a hierarchically tiled array. Although this 
definition allows different tiles to be partitioned in different ways, most often 
HTAs will be homogeneous, that is adjacent submatrices at each level will not 
only have the same size as their neighbors along the dimension of adjacency, 
but they will also agree in the number and position of the partitions along that 
dimension. 

A two-level hierarchy where neighboring tiles are partitioned differently, and 
therefore depicts a non-homogeneous HTA, is shown in Fig. 3(a). In this figure, 
the outer tiles are separated by the dashed lines and the inner tiles by the 
dotted lines. There are three mismatches in Fig. 3(a). One is between outermost 
tile {1,2}, which is not partitioned at all, and tile {1,1} which is partitioned 
into two parts along the vertical dimension which is the dimension of adjacency 
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Fig. 3. Two level tiled arrays 




Fig. 4. Bottom up tiling 




between these two tiles. The other two mismatches are between outermost tiles 
{1,1} and {2,1} and between tiles {2,1}, and {2,2}. Fig. 3(b) is an example 
of homogeneous HTA where the number of tiles and the sizes of all tiles match 
along the dimensions of adjacency. 

2.2 Construction of HTAs 

A simple way to obtain homogeneous HTAs is to tile the matrix at the lowest 
level of the hierarchy first and then proceed recursively by tiling the resulting 
array of tiles. This bottom-up process, illustrated in Fig. 4, always generates 
homogeneous HTAs. 

We can alternatively start from the top and successively refine each partition. 
The top down approach is more flexible than the bottom up approach in that it 
enables the generation of both homogeneous and nonhomogeneous HTAs. 

In an interactive array language such as MATLAB, HTAs can be built follow- 
ing either approach if the appropriate functions are available. For the bottom 
up approach we define the function tile that accepts as parameters an Tri- 
dimensional HTA or unpartitioned array and m vectors, p\,p 2 , ■ ■ ■ ,p m , (one for 
each dimension of the HTA) and returns an HTA partitioned by the hyperplanes 
defined by Pi(k), 1 < i < m,l < k < size(pi). These partition dimension i of 
the array right after element Pi(k). For example, given a 10 x 12 matrix D, the 
statements 

C = tile(D, [2, 4, 6, 8] , [3,6,9] ) ; 

B = tileCC, [3] , [1,2,3]) ; 

A = tile(B, [1] , [1]); 

will generate the three HTAs shown in Fig. 4. 



( 2 . 1 ) 
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For the top down approach we define the function hta which accepts m nat- 
ural numbers as parameters, <fy, <fy, ..., d m , and returns a d\ x ...d m array whose 
elements are empty tiles that can hold HTAs or unpartitioned arrays. Before 
presenting an example of top down creation of HTAs, we need to describe how 
to address the tiles in an HTA. The outermost tiles of an HTA can be addressed 
using subscripts enclosed by curly brackets. An additional set of subscript should 
be added for each level of the HTA that needs to be addressed. Thus, the tile 
containing element E(5,4) if E is partitioned as shown in Fig. 1(a) would be 
accessed as E{3,2}. Also, the inner tile containing element F(5,4) in an array 
F with the shape shown in Fig. 3(b), would be addressed as F{2 , 1}{ 1 , 2}. 

We can now illustrate the top down creation of HTAs. The top two levels of 
an array E with the shape shown in Fig. 3(a) could be created as follows: 

G = hta(2 ,2) ; 

G{1,1} = hta(2,2) ; (2.2) 

G{2 , 1} = hta(2 , 3) ; 

G{2,2> = hta(3) ; 

and the elements of the upper left quadrant could be filled with two-dimensional 
arrays of normally distributed random number as follows: 



G{l,l}{l,l} = randn(2,3); G{l , l}{l , 2} = randn(2 , 6) ; 

G{1 , l}{2 , 1} = randn(2 , 3) ; G{1,1}{2,2} = randn(2,6); 

A drawback of the bottom up approach as illustrated in (2.1) is that it creates 
intermediate HTAs which are in most cases unnecessary. A reasonable compiler 
could have these temporary HTAs deleted after their only use in the creation 
sequence or could avoid their creation altogether by, for example, reversing the 
creation process into a top down form. As can be seen in the foregoing example, 
the top down approach does not suffer of this problem. 

2.3 Addressing the Scalar Elements of an HTA 

We discuss next how to address the scalar elements of an HTA. The simplest 
way to address an element is to ignore all tiling and address the elements using 
conventional subscripting. For example, element (4,5) of an array, H, that has 
been tiled as shown in Fig. 1(b) can be addressed as H(4,5). To use tiling for 
addressing a scalar element, we can use the curly bracket notation introduced 
above followed by conventional subscripts enclosed within parenthesis. The con- 
ventional subscripts specify the location of the element within the innermost tile 
in the hierarchy. Thus, element H(4,5) can also be addressed as H{2,2}(2,3). 
We call flattening the mechanism that allows addressing an array ignoring the 
tile structure. Thus, we say that flattening enables the use of H(4,5) to access 
element (4,5) of array H. Flattening can also be applied at an intermediate level 
of the hierarchy. For example element (5,7) of an array A tiled a shown in Fig. 4 
could be referenced as A(5,7), or as A{ 1 ,2}{2}{3}(1 , 1) if all the levels of the 
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tiling hierarchy are taken into account. We could also flatten the last level of 
the hierarchy and address the same element as A{1 , 2} { 2} (5 , 1) or flatten the 
second level to get A{ 1 , 2} (5 , 4) . 

2.4 Assignments and Expressions Involving HTAs 

The last topic to be discussed in this section is the meaning of assignment state- 
ments and expressions involving HTAs. Our objective is to generalize the notion 
of conformable arrays of Fortran 90 and the semantics of assignments to unde- 
fined variables of MATLAB. Let us first present four definitions that we will 
need in this section. 

— We call leaf elements the elements of an HTA that do not have any compo- 
nents. The leaf elements could be empty containers or arrays of scalars. In 
the spirit of MATLAB, scalars cannot appear in isolation within HTAs and 
will be represented as 1 x 1 arrays. 

— We say that an HTA is complete when all of its leaf elements are arrays of 
scalars. Otherwise, when some of the leaf elements are empty containers, the 
HTA is said to be incomplete. For example, arrays A, B, and C in the sequence 
(2.1) are complete. On the other hand, array A right after the sequence (2.2) 
is incomplete and will remain incomplete after the statements in (2.3) are 
executed, because these statements do not fill the containers A{ 1 , 2}, A{2 , 1} 
and A{2,2}. 

— In Fortran 90, two arrays with the same shape (that is, the same number 
of dimensions and the same size in each dimension) are conformable. Also, 
scalars (and lxl arrays in our case) are conformable to arrays of any 
shape. Scalar binary operations such as add and multiply are extended in 
Fortran 90 to work on conformable objects. When both operands are arrays 
with the same shape, the operation is performed on corresponding pairs of 
scalars. When one of the objects is a scalar and the other an array, the 
scalar is operated with each of the elements of the array. Thus, c(l:10, 
1 : 20 : 3) +d ( 1 : 10 , 1 : 7) is a valid Fortran 90 operation since the operands are 
both 10 x 7 arrays. Here, corresponding elements of the operands are added 
to each other to produce an array that is conformable to the operands. The 
expression e ( : , : ) +5 is also valid and will add the scalar 5 to each element 
of array e producing an array with the shape of e. 

— Two complete HTAs have the same topology if their outermost array of tiles 
have the same shape and corresponding outermost tiles are HTAs with the 
same topology or contain arrays of scalars that are conformable. This means 
that two HTAs will have the same topology if the only difference between 
them is on the leaves where the arrays have to be conformable, but do not 
have to have identical shapes. 

We now proceed by discussing conformability, expressions, and assignment 
operations. 



168 Gheorghe Almasi et al. 




Fig. 5. Operating on a section of an HTA 



Conformability. Two complete HTAs are conformable if they have the same 
topology or one of them is conformable to all elements at the top level of the 
other. Notice that the second part of the definition is recursive. That is, if the 
smaller HTA does not have the same topology to one of the top level elements, 
then it must have the same topology of all components of this top level element, 
and so on. Informally, this definition means that two HTAs of different sizes 
will be conformable if the smaller one has the same topology of all elements of 
the other that are a certain level above the leaf elements. Notice also that our 
definition implies that a scalar is conformable to any complete HTA. 



Expressions involving HTAs. Following Fortran 90 the meaning of scalar 
operations is extended so that when the operands are both HTAs with the same 
topology the operation will be performed between corresponding scalar elements 
and will return an HTA with the topology of the operands. When the operands 
have different topologies, the smaller one is operated with all matching objects 
at the bottom of the hierarchy of the larger one. For example, adding a 2 x 3 
array M is to HTA A resulting from sequence (2.1) is a valid operation that would 
result in M being added to all 2 x 3 arrays of scalars that are at the bottom of the 
hierarchy of A. Similarly, A + 3 is valid and will add 3 to each scalar element in 
A. Notice that flattening changes the topology of an HTA. Thus, while the term 
B by itself represents the HTA computed in sequence (2.1) and therefore has two 
levels of tiling, the term B{ ) represents an HTA with a single level of 

tiling. 

It is also possible to operate on a section of an HTA. Thus, B{1, :}{2:3}+l 
will operate on only one section of the HTA and will return an HTA with the 
shape of the section as illustrated in Fig. 5. 

Also following Fortran 90, scalar intrinsic functions are extended to operate 
on complete HTAs. These functions will operate on each scalar separately and 
will return an HTA with the topology of the operand. For example, sin (A) will 
return an HTA with the topology of A, but with each scalar replaced by its sine. 
Similarly, intrinsic array operations involving a single array will be extended in 
the natural way. For example, max (A) will return an HTA that will have the 
topology of A, except that every array of scalars will be replaced by a single 
scalar (which is a 1 x 1 array, as stated above) that contains the maximum value 
of the array it replaces. 
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Assignments. Next, we generalize the semantics of assignment operations. In 
MATLAB, when the name of an array X appears in an expression, it refers to 
the whole array, but on the left hand side of an assignment statement X refers to 
the variable name as a container. Thus, in MATLAB if X is the one-dimensional 
array [1 2] , the expressions X+l and X(1 :2) + l have the same meaning, adding 
one to each element. On the other hand, while X(1 : 2) =3 will change X into [3 
3] , X=3 will change X into the scalar 3. We extend this semantics to HTAs by 
assuming that references to containers that appear in expressions represent their 
content while on the left hand side of an assignment statement they represent 
the containers themselves. Thus, B=5, where B is the HTA constructed in (2.1) 
will replace B with the scalar 5. However, B{ , :}=5 will replace each of 
the 2x3 arrays inside B by a 1 x 1 array containing 5 and B{ )=0 
will replace each of the 2x3 arrays inside B with a 2 x 3 array of zeros. 

3 Mapping HTAs onto Memory 

To specify how an HTA is to be mapped onto the memory of a machine, we could 
add a parameter to the functions for building HTAs introduced in the foregoing 
section or create a function variant for each type of mapping. We will follow the 
second approach in this paper. 

We consider two classes of mappings. First, we discuss the mapping onto the 
main memory of a conventional uniprocessor or a shared memory multiproces- 
sor. This mapping associates a unique memory location to each subscript value. 
Most programming languages assume a linear mapping, whose main advantage 
is that computation of the memory location is simple and successive elements 
of an array along any dimension can be computed by addition without the need 
for multiplication. Compilers take advantage of this property via the strength 
reduction optimization. 

Linear mapping can be done at any particular level of an HTA by laying 
out the tiles at this level in consecutive memory locations following a row major 
order or a column major order. We will assume that the functions tile and hta 
allocate objects in a row major order. To obtain column major order we would 
have to define new functions such as tileColumMajor or htaColumMajor. Other 
layout functions that are advantageous for some classes of algorithms such as C- 
order, U-order, Hilbert order, and Z or Morton order can be attained similarly by 
creating the appropriate functions (e.g. tileCOrder) and extending the compiler 
to generate the corresponding address expressions. 

Next, we discuss mapping onto the different nodes of a multicomputer or 
distributed-memory multiprocessor. To this end, we will assume that the nodes of 
the target machine form an n-dimensional mesh. The mesh (virtual) organization 
is by far the most frequently assumed topology in parallel programming. In 
our extension to MATLAB, we will use descriptors of node arrangement that is 
created by the nodes function and could be assigned to a variable. The invocation 
nodes(di, d, 2 , ■ ■ . , d n ) returns a descriptor of a d\ x cfo x . . . x d n mesh of nodes. 
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For parallel programming, the top level of an HTA can be distributed across 
the nodes of a distributed memory machine using functions htaD and tileD. In 
the simplest case, when the top level of that HTA has the same shape as the 
node mesh, the meaning of the distribution operation is the obvious one: Each 
tile at the topmost level of the HTA is allocated to a different node. For example, 
to distribute HTA A created in (2.1) across a 2 x 2 node mesh we could modify 
the sequence as follows: 

P = nodes (2,2); 

C = tile(D, [2,4,6,8] , [3,6,9]) ; 

B = tile(C, [3] , [1,2,3]) ; 

A~= tileDCP, B, [1] , [1]); 

Here, tile A{i,j} is allocated to node (i, j), 1 < i < 2,1 < j < 2. Similarly, 
to distribute the top level of HTA G created in (2.2), one tile per node, we could 
modify sequence (2.2) as shown next. Here, again tile G{i , j} is allocated to node 
(i,j), 1 < i < 2,1 < j<2: 

P = nodes (2,2); 

G = htaD (P, 2, 2) ; 

G{1,2> = hta(2 ,2) ; 

G{2,1> = hta(2 , 3) ; 

G{2,2> = hta(3) ; 

If the top level of the HTA has the same number of dimensions, but fewer 
components than the mesh of nodes where it is to be distributed, allocation will 
take place on consecutive processors starting at node 1 on each dimension. If 
the top level of the HTA has fewer dimensions than the mesh, then the top level 
HTA is extended with additional dimensions of size one to match the number 
of dimensions of the mesh. It is invalid for the top level of the HTA to have 
more dimensions than the mesh where it is to be distributed. If the top level 
of the HTA has more elements along one of the dimensions than the number of 
processors along that dimension, we assume a cyclical distribution. 

4 Programming for Locality with HTAs 

Following the pioneering work of McKellar and Coffman [10], linear algebra com- 
putations are today usually organized to access arrays one tile at a time [2, 6, 13]. 
The same approach has been studied as a compiler optimization technique where 
loops are automatically restructured so that arrays are accessed by tiles rather 
than in the more natural but less efficient row or column order [1, 14, 11]. Al- 
though in some cases these algorithms require that the arrays to be manipulated 
be stored by tiles, in many cases this is not necessary and the reorganization of 
the computation usually suffices to significantly improve the memory hierarchy 
performance of the algorithms. Nevertheless, for large arrays, storage by tiles is 
desirable when the unit of transfer (page or cache line) is large [10] or to avoid 
cache collisions [13]. 



Programming for Locality and Parallelism with Hierarchically Tiled Arrays 171 



The HTA notation of this paper should produce significantly more readable 
code when programming for locality. At the same time, our notation enables the 
layout of arrays in block order which should help performance for large arrays as 
was just mentioned. We now illustrate the benefit of HTA when programming 
for locality using the simple case of matrix multiplication. The typical matrix 
multiplication algorithm with tiling in the three dimensions has the following 
form: 

for I=l:q:n 

for J=l:q:n 

for K=l:q:n 

for i=I:I+q-l 

for j=J : J+q-1 

for k=K:K+q-l 

c(i,j) = c (i , j ) + a(i ,k) * b(k,j); 

Here and in the following examples, we assume that c is initially all zeros. This 
loop is clearly much more complex than the version that does not use tiles, and 
would be even more complex had we not assumed that the size of the matrix, 
n, is a multiple of the block size, q. In contrast, the algorithm implemented on 
a tiled array stored as a single level HTA would have the following form: 

for I=l:m 

for J=l:m 

for K=l:m 

c{I,J} = c{I,J} + a{I,K} * b{K,J> (4.1) 

This is a much simpler and easier to read form of the same algorithm. One 
reason for the simplicity is the use of the HTA notation. It also helps that in 
MATLAB * stands for the matrix-matrix multiply operator. Notice that, for the 
algorithm to work not all tiles have to have the same size nor be square. Clearly, 
before code (4.1) executes, a, b and c must be created using functions such as 
hta or tile. 

Several levels of blocking can be useful in dealing with several levels of the 
memory hierarchy. A simple way to extend (4.1) to handle several levels of block- 
ing is to replace a{l ,K}*b{K, J} with an invocation to a user written function 
that uses recursion by calling itself to multiply the tiles of its operands when 
they are HTAs, and which stops the recursion when its parameters are arrays of 
scalars. 

5 Parallel Programming with HTAs 

The only construct we use to express parallelism are array or collective operations 
on distributed HTAs. We assume that the main thread of our parallel programs 
will execute on a client sequential machine that could be a workstation. All 
variables, except distributed HTAs, will be assumed to reside in the memory 
of this client. The distributed HTAs on the other hand will be contained in 
the memory of a parallel server. All operations on elements of a distributed 
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HTA will take place in the server as dictated by a simple version of the owner- 
computes rule: the operations that compute values to be stored in an object must 
be performed in the node containing the object. We assume that the compiler will 
take care of generating the code that is executed in the server and of inserting 
message-passing primitives so that the needed values are moved to the location 
before they are needed for the computation. 

For example, assume that HTAs A and B have the topology of Fig. 1(a) 
and that their tiles are distributed across a two-dimensional array of nodes or 
processors. Consider then the statement 

A{ : , : } = A{ : , : } . * 

This statement means that for 1 < i < 5 and 1 < j < 4, tile A{i,j} and tile 
B{i,j} should be multiplied element by element (.* is the element-by-element 
multiplication operator in MATLAB) and the result should replace tile A{i, j}. 
Since the result of multiplying A{i,j} by B{i, j} is to be stored in A{i, j}, the 
multiplication must take place in the node containing A{i, j}. Also, these mul- 
tiplications can proceed in parallel with each other since the operation appears 
in an array statement. Notice that in this case no communication is necessary 
to execute the statement. On the other hand, the statement 

A{1 : 4, : } = A{1:4, :> .* B{2:5, 

requires communication. Therefore, the compiler must generate the appropriate 
message-passing primitives so that for 1 < i < 4 and 1 < j < 4 tile B{i+1 , j} be 
copied to a temporary in the node containing A{ i , j } before the operation can 
take place. 

Consider finally the statement 

A{: , :> = A{: , :> .* X; 

where X is a variable residing in the client. In this case the compiler will have 
to generate a broadcast operation to send the value of X to all nodes before the 
operation can take place. 

Before proceeding with the examples, we need to make an additional ex- 
tension to MATLAB. As mentioned above, in MATLAB when the operands are 
arrays, the * operator represents matrix multiplication and . * represents element 
by element multiplication. With the introduction of tiled arrays, we introduce 
additional level in the data hierarchy and the meaning of * and . * must be ex- 
tended. We will assume that * between to HTAs with the topology of Fig. 1(a) 
will produce the effect of matrix multiplication at the tile level. Thus, we will 
assume that c{ : , : }=a{ : , : }*b{ : , : } or simply c=a*b will have the same effect 
as loop (4.1). If we just wanted to multiply corresponding submatrices, we will 
write c{ : , : }=a{ : , : } . *b{ : , : } or c=c . *b. This will be equivalent to: 

for I=l:m 

for J=l:m 

c{I,J> = a{I,J} * b{I,J>; 

Notice that in the loop the operands of * are matrices and therefore the operator 
stands for matrix multiplication, the same meaning it has in MATLAB. Finally, 
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if we just want to multiply corresponding scalars in two-level HTAs, we would 
write c=a. . *b. 

Next, we present two examples of parallel programs using HTAs. The first 
is a dense matrix-matrix multiply and the second is a matrix vector multiply 
where both the matrix and the vector are sparse. 

For our first example, we will implement the SUMMA algorithm. The algo- 
rithm has a very simple representation using HTAs. To explain the algorithm, 
consider first the matrix multiplication loop (4.1) with the innermost loop (loop 
K) moved to the outermost location: 

for K=l:m 

for I=l:m 

for J=l:m 

c{I,J> = c{I,J} + a{I,K} * b{K,J>; 

The inner two loops increment the array c{ : , : } so that for 1 < I < m and 
1 < J < m tile c{l, J} is incremented by a{l ,K}*b{K, J} on each iteration of the 
outermost loop. Notice that for each I, J, and K, the tiles a{l,K} and b{K,J} 
are each used in the computation of m different tiles of c. Also, the inner two 
loops are a parallel operation on the two-dimensional array of tiles c which 
can be easily represented in array form if a and b are appropriately replicated. 
The introduction of the replication operations, leads directly to the SUMMA 
algorithm. 

In our notation, we can achieve the replication extending the MATLAB 
repmat function to HTAs. The first parameter of the MATLAB repmat function 
is the matrix to replicate, the second parameter is the number of copies to make 
in the first dimension, the third parameter is the number of copies in the second 
dimension, and so on. Our repmat is an overloaded version that has the same 
semantics as the original one except that it operates on distributed arrays of 
tiles instead of on arrays of scalars. 

for K=l:m 

= repmat (a{ : ,K}, 1, m) ; 
t2{:,:} = repmat (b{K, m, 1); 

+ tl{:,:> .* t2{:,:>; 

The repmat function when applied to distributed HTAs could be imple- 
mented in many different ways depending on the characteristics of the target 
machine and the mapping of the source HTA onto the parallel machine. 

The previous loop can be written in a simpler form: 

for K=l:m 

c{:,:> = c{ : , : } + a{:,K> * b{K,:>; 

This representation leaves the decision of how to implement the broadcasting 
of a and b to the compiler, while in the previous loop the programmer exercises 
some control by choosing the appropriate routine. 

The second example will be a matrix vector multiplication where both the 
vector and the matrix are sparse. Coding is significantly simplified by the way 
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MATLAB handles sparse computations. In fact, sparse matrices are operated in 
MATLAB using the same syntax used for dense computations. The MATLAB in- 
terpreter automatically selects the appropriate procedure to handle sparse data. 

Let us assume first that the data is originally in matrix a and vector b, both 
located in the client. Array a will be distributed by blocks of rows across the 
nodes of the target machine. To this end, a is assigned to HTA c that is just a 
distributed linear arrangement of containers. Also, vector b will be distributed 
by blocks of elements using HTA v. There is a vector dista in the client that 
specifies which rows of a are to be assigned to c{l}. These are rows dista(I) 
to dista(I+l) - 1 . Similarly, array distb specifies which elements of b will be 
assigned to v{l}. The first step of the code would, therefore, contain the following 
statements: 

Step 1 : P = node(n); 

c = htaD(P,n) ; 
v = htaD(P,n) ; 
for 1=1 :n 

c{I> = a(dista(I) : dista(I+l)-l , : ) ; 

v{I}(distb(I) :distb(I+l)-l)=b(distb(I) :distb(I+l)-l) ; 

If matrix a or vector b are too large to fit the client, the previous loop could 
be easily replaced by an I/O function that will read the data directly to the 
components of c and v. 

The matrix vector multiplication will be performed in chunks. In fact, each 
node, I, will compute a chunk of the vector by multiplying c{l} by v. However, 
only the elements of the vector corresponding to nonzero columns of c{l} are 
needed. If we provide to each node a copy of vector distb, by analyzing c{l} 
and correlating the result with distb the node can easily determine, for each 
J, which elements of v{j} will be needed to perform the c{l}*v operation. The 
result of this analysis will be stored in HTA w. Node I will assign to each w{l}{ J} 
a vector containing the indices of the elements needed from v{j}. We assume 
the existence of a function need that computes w{l}. This function should be 
asy to write by a programmer familiar with MATLAB. The second step of our 
algorithms is then to call the function need as follows: 

Step 2: forall 1=1 :n 

w{I} = need(c{I}, distb); 

Here, we have used the forall construct with the same meaning it has in Fortran 
90. MATLAB does not have such a construct, but we have found it necessary in 
many cases to implement parallel algorithms. 

The next step in the algorithm is to send the data in w{l}{j} (contained in 
node I) to node J for all I and J. In this way, node J will know which elements 
of its vector block, v{j}, are needed by node I. We will store this information 
in HTA x so that x{l}{j} will contain a vector of the indices of the elements 
needed by node J from node I. Clearly, x is the transpose of w, therefore step 3 
of the computation will just be: 

Step 3: x = tileTranspose (w) ; 
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In the next step each node, I, gathers, for all J, into y{ I } { J} the elements 
of v{l} needed by node J. Then this data is sent to the appropriate node using 
another transpose operation: 

Step 4: forall 1=1 :n 
for j=l:n 

y{lMJ> = v{I}(x{I}{J}(:)); 
z = tileTranspose (y) ; 

Finally, each local vector is extended with the data that just arrived into z 
and the matrix vector multiplication can be performed: 

Step 5: forall I=l:n 

v{I}(w{I}(:)) = z{I}(:); 
v{:> = c{:> * v{:>; 

6 Conclusions 

The parallel programming approaches that have attracted most attention in 
the recent past fall at the two extremes of the range of possible designs. On 
one hand there is the SPMD, message-passing programming model. MPI is by 
far the most popular implementation of this model, but not the only one. Two 
parallel programming languages, Co- Array Fortran and UPC, are other examples 
of this model. The incorporation of communication primitives into programming 
languages like Co-Array Fortran and UPC significantly reduces the amount of 
detail that must be specified for each communication operation in the library- 
based approach of MPI. However, in our opinion, Co- Array Fortran and UPC 
do not go far enough due to their adoption of the SPMD model which can 
easily lead to unstructured code. This lack of structure could be the result of 
communication taking place between widely separated sections of code with the 
additional complication that a given communication statement could interact 
with several different statements during the execution of a program. In theory 
at least, the lack of structure possible with SPMD programs could be much 
worse than anything possible with the use of goto statements in conventional 
programming. In other, perhaps more colorful, words what we are saying is that 
the use of the SPMD programming model could lead to four- dimensional spagetti 
code. 

The other class of parallel programming models on the spotlight mostly fol- 
lows a single-threaded model. Languages in this class include the OpenMP direc- 
tives [4] and High Performance Fortran. One difficulty with OpenMP is that it 
assumes a shared memory support that is not always available in the hardware 
of todays machine. The shared- memory model could be implemented in soft- 
ware, but that often leads to highly inefficient parallel programs. A second, and 
much more serious, limitation of OpenMP is that the directives do not explic- 
itly represent the notion of locality. This is a very important notion for parallel 
programming since distributed memory is a physical necessity in large-scale mul- 
tiprocessors. It could be said that the compiler could take care of rearranging 
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the code and distributing data to take care of locality, but a proven compiler 
technology for this purpose is not at hand today. High-Performance Fortran is 
based on sequential source code complemented with directives mainly for speci- 
fying how data is to be distributed. It is the task of the compiler to transform the 
sequential code into SPMD form and generate all the necessary communication 
primitives. Unfortunately, from the poor reception given to HPF it seems that 
automatically producing highly efficient code from HPF source is beyond the 
capabilities of todays technology. 

Our proposal lies somewhere between these two extremes. The programming 
model is single-threaded, but communication and distribution is explicit. There- 
fore, the requirements from the compiler should be more modest that those of 
HPF. Our experience in the programming of both dense and sparse kernels is 
that the use of array notation and the incorporation of tiling in a native data 
type significantly improve readability when programming for locality and paral- 
lelism. This is clearly due to the importance of tiling for parallel programming 
and for locality, a fact that has become increasingly evident in the recent past. 
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Abstract. Co-array Fortran (CAF) is an emerging model for scal- 
able, global address space parallel programming that consists of a small 
set of extensions to the Fortran 90 programming language. Compared 
to MPI, the widely-used message-passing programming model, CAF’s 
global address space programming model simplifies the development of 
single-program-multiple-data parallel programs by shifting the burden 
for choreographing and optimizing communication from developers to 
compilers. This paper describes an open-source, portable, and retar- 
getable CAF compiler under development at Rice University that is 
well-suited for today’s high-performance clusters. Our compiler trans- 
lates CAF into Fortran 90 plus calls to one-sided communication prim- 
itives. Preliminary experiments comparing CAF and MPI versions of 
several of the NAS parallel benchmarks on an Itanium 2 cluster with 
a Myrinet 2000 interconnect show that our CAF compiler delivers per- 
formance that is roughly equal to or, in many cases, better than that of 
programs parallelized using MPI, even though support for global opti- 
mization of communication has not yet been implemented in our com- 
piler. 



1 Introduction 

Parallel languages and parallelizing compilers have been a long term focus of 
compiler research. To date, this research has not had the widespread impact 
on the development of parallel scientific applications that had been hoped. The 
two standard parallel programming models suited to scientific computation that 
have received industrial backing are OpenMP [1] and High Performance Fortran 
(HPF) [2]. However, both of these models have significant shortcomings that 
reduce their utility for writing portable, scalable, high-performance parallel pro- 
grams. OpenMP programmers have little control over data layout; as a result, 
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OpenMP programs are difficult to map efficiently to distributed memory plat- 
forms. In contrast, HPF enables programmers to explicitly control the mapping 
of data to processors; however, to date, commercial HPF compilers have failed to 
deliver high-performance for a broad range of programs. As a result, the Message 
Passing Interface (MPI) [3] has become the de facto standard for parallel pro- 
gramming because it enables application developers to write portable, scalable, 
high-performance parallel programs using very sophisticated parallelizations un- 
der programmer’s control. 

Recently, there has been a significant interest in trying to improve the pro- 
ductivity of parallel programmers by using language-based parallel programming 
models that abstract away most of the complex details of high-performance com- 
munication (e.g. asynchronous calls), yet provide programmers with sufficient 
control to enable them to employ sophisticated parallelizations. Two languages 
in particular have been the focus of recent attention as promising near-term alter- 
natives to MPI: Co-array Fortran (CAF) [4, 5] and Unified Parallel C (UPC) [6]. 
Both CAF and UPC support a global address space model for single-program- 
multiple-data (SPMD) parallel programming. Communication in these languages 
is simpler than MPI: one simply reads and writes shared variables. With commu- 
nication and synchronization as part of the language, these languages are more 
amenable to compiler-directed communication optimization than MPI programs. 

To date, CAF has not appealed to application scientists as a model for devel- 
oping scalable, portable codes, because the language is still somewhat immature 
and a fledgling compiler is only available on Cray platforms [7] . At Rice Univer- 
sity, we are working to create an open-source, portable, retargetable, high-quality 
CAF compiler suitable for use with production codes. Our compiler translates 
CAF into Fortran 90 plus calls to ARMCI [8] , a multi-platform library for one- 
sided communication. Recently, we completed implementation of the core CAF 
language features, enabling us to begin experimentation to assess the potential 
of CAF as a high-performance programming model. Preliminary experiments 
comparing CAF and MPI versions of the BT, MG, SP and CG NAS parallel 
benchmarks [9] on a large Itanium 2 cluster with a Myrinet 2000 interconnect, 
show that our CAF compiler prototype already yields code with performance 
that is roughly equal to hand-tuned MPI. 

In the next section, we briefly describe the CAF language and the ARMCI 
library that serves as the communication substrate for our generated code. Sec- 
tion 3 proposes extensions to CAF to enable it to deliver portable high per- 
formance. In Section 4, we outline the implementation strategy of our source- 
to-source CAF compiler. Section 5 presents our recommendations for writing 
high-performance CAF programs. In Section 6, we describe experiments using 
versions of the NAS parallel benchmarks to compare the performance of CAF 
and MPI. Section 7 presents our conclusions and outlines our plans for future 
work. 
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2 Background 

Co- Array Fortran (CAF) supports SPMD parallel programming through a small 
set of language extensions to Fortran 90. Like MPI programs, an executing CAF 
program consists of a static collection of asynchronous process images. CAF pro- 
grams explicitly manage data locality and computation distribution; however, 
CAF is a global address space programming model. CAF supports distributed 
data using a natural extension to Fortran 90 syntax. For example, the declara- 
tion integer : : x(n,m) [*] declares a shared co-array with n x m integers local 
to each process image. The dimensions inside brackets are called co-dimensions. 
Co-arrays may also be declared for user-defined types as well as primitive types. 
A local section of a co-array may be a singleton instance of a type rather than 
an array of type instances. Instead of explicitly coding message exchanges to 
obtain data belonging to other processes, a CAF program can directly reference 
non-local values using an extension to Fortran 90 syntax for subscripted refer- 
ences. For instance, process p can read the first column of data in co-array x from 
process p+1 with the right-hand side reference to x( : , 1) [p+1] . The CAF lan- 
guage includes several synchronization primitives; the most important of them 
are sync_all, which implements a synchronous barrier, sync_team, which is 
used for barrier-style synchronization among dynamically-formed teams of two 
or more processes, and start_critical/ end_critical primitives for controlling 
entry to a single global critical section. Since both remote data access and syn- 
chronization are language primitives in CAF, communication and synchroniza- 
tion are amenable to compiler-based optimizing transformations. In contrast, 
communication in MPI programs is expressed in a more detailed form, which 
makes it more difficult to improve with a compiler. CAF also contains several 
features that improve the expressiveness and power of the language including dy- 
namic allocation of co-arrays, co-arrays of user-defined types containing pointers, 
and fledgling support for parallel I/O. A more complete description of the CAF 
language can be found elsewhere [5]. 

2.1 ARMCI 

The CAF compiler we describe in this paper uses the Aggregate Remote Mem- 
ory Copy Interface (ARMCI) [8] — a multi-platform library for high-performance 
one-sided (get and put) communication — as its implementation substrate for 
global address space communication. One-sided communication separates data 
movement from synchronization; this can be particularly useful for simplify- 
ing the coding of irregular applications. ARMCI provides both blocking and 
split-phase non-blocking primitives for one-sided communication. On some plat- 
forms, using split-phase primitives enables communication to be overlapped with 
computation. ARMCI provides an excellent implementation substrate for global 
address space languages making use of coarse-grain communication because it 
achieves high performance on a variety of networks (including Myrinet, Quadrics, 
and IBM’s switch fabric for its SP systems) while insulating its clients from 
platform-specific implementation details such as shared memory, threads, and 
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DMA engines. A notable feature of ARMCI is its support for non-contiguous 
data transfers [10]. 

3 Towards Portable High-Performance CAF 

The CAF programming model is still emerging. Prior to our compiler, the only 
existing CAF compiler implementation was for the Cray T3E and XI platforms — 
tightly-coupled shared memory architectures with high-performance intercon- 
nects that support efficient fine-grain communication and global synchroniza- 
tion. The original CAF language specification [4, 5] was influenced by these 
architectural features, leading to CAF codes that would not perform well on less 
tightly-coupled architectures. Evaluating the performance of CAF codes written 
according to the original language specification on a Myrinet cluster helped us 
to identify several features of the specification that reduce the potential per- 
formance of CAF codes. Below we discuss some of these features along with 
approaches we propose to address the problems they cause. 

Memory fence semantics associated with CAF procedure calls. The 
sync_memory intrinsic is a memory fence that ensures the consistency of a pro- 
cess image’s local memory by waiting for the completion of all of that process’s 
outstanding communication events. To ensure a consistent state for co-array 
data accessed during or after a procedure call, the original CAF model requires 
implicit memory fences before and after every procedure invocation. We found 
this requirement to be overly restrictive since it prevents overlapping commu- 
nication with a procedure call, which is often an important strategy for hiding 
communication latency. It should be possible for a sophisticated programmer 
to relax this requirement where it is unnecessary for correctness. We are in the 
process of exploring design alternatives that will make this possible. 



Overly restrictive synchronization primitives. An issue that arose dur- 
ing our application evaluation was that using synchronization primitives in the 
original CAF language specification reduced the performance of applications we 
studied. For example, the original CAF specification only supports collective 
synchronization (sync_all and sync_team); however, many applications require 
only unidirectional, point-to-point synchronization. Using collective synchroniza- 
tion where only point-to-point synchronization is needed degrades performance 
and in some cases makes programming harder. We propose sync_notify (q) and 
sync_wait(p) as two new intrinsics for point-to-point synchronization. When 
a process executes a sync_notify, it initiates notification of the specified process 
image and then can continue immediately. When a process executes a sync_wait, 
it must block until it is notified by the specified process image. When a notifica- 
tion from process p is delivered to process q , all pending communication events 
(both puts and gets) that p issued to q before p initiated the sync_notify have 
completed. 
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Collective operations. The CAF language specification does not provide col- 
lective communication intrinsics. CAF is expressive enough so that users can 
write collective communication routines in CAF ; however this is likely to result 
in programs tailored to a particular architecture (and in many cases to a range 
of processor counts too) that are unlikely to deliver high performance on archi- 
tectures with different communication latency and bandwidth characteristics. 
CAF should be extended to include collective communication intrinsics to give 
a CAF compiler flexibility to choose an appropriate algorithm and implementa- 
tion suited to the target architecture at hand. We are in the process of designing 
a set of CAF intrinsics for collective communication. 

4 Compiler Implementation Strategy 

We have implemented the core features of CAF, enabling us to express non-trivial 
CAF programs. Section 6 gives a description of some programs we have compiled 
and evaluated. Our compiler performs source-to-source transformation of CAF 
codes into F90 plus calls to a communication library (currently ARMCI). This 
strategy was designed to leverage the best back-end compiler available on the tar- 
get platform to optimize local computation. Our CAF compiler is implemented 
on top of Open64/sl [11], a version of the Open64 compiler infrastructure [12] 
that we have modified to support source-to-source transformation of Fortran 90. 
Below we outline some of the principal compiler design issues that arose when 
implementing CAF. 



Memory management issues. Current operating systems do not usually al- 
low for sharing of arbitrary memory allocated independently by different pro- 
cesses. For this reason, memory for co-arrays must be managed by the commu- 
nication substrate separately from memory managed conventionally by an F90 
compiler’s language runtime system. Having the communication library allocate 
co-array memory enables our generated code to use the most efficient communi- 
cation strategy for a particular platform. For example, on an SMP machine the 
memory can be allocated in shared memory which would enable communication 
to be performed using processor load and store instructions. On a Myrinet-based 
cluster, allocating data for a communication event in pinned physical memory 
enables the library to perform data transfers on the memory directly using the 
Myrinet adapter’s DMA engine. 

For CAF programs to perform well, access to the local portions of co-arrays 
must be efficient. Since co-arrays are not supported in F90, we need to translate 
references to the local portion of a co-array into a valid F90 construct and 
this construct must be amenable to back-end compiler optimization. We believe 
that the best strategy is to use an F90 pointer to access local co-array data. 
However, the difficulty with this strategy is that we want to allocate co-array 
data outside F90-managed memory. To use an F90 pointer to access co-array 
data, we must initialize the pointer’s dope vector outside an F90 compiler’s 
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language runtime system. This requires compiler-dependent code for initializing 
F90 pointers, which poses a minor difficulty when retargeting. 

Co-array sequence association and reshaping. CAF explicitly provides 
sequence association between local parts of co-arrays in common blocks, but 
equivalence of co-array and non-co-array memory is prohibited. To support se- 
quence association, our compiler allocates storage once for each common block 
at program launch and then sets up a procedure-level view for each common 
block containing co-arrays. Our CAF compiler implements this using a two-part 
strategy. First, at compile time, it generates a set of static initializers, which set 
up each procedure’s view of a common block containing co-arrays. Next, at link 
time, a global initialization routine is generated by collecting the static initializ- 
ers. This routine allocates memory once for each common block and invokes the 
static initializer to create each procedure’s view in turn. 

CAF allows programmers to pass co-arrays as arguments to procedures. For 
each formal co-array parameter passed by reference, our implementation aug- 
ments the subroutine prototype with a “hidden” parameter; each hidden pa- 
rameter is a pointer to a runtime data structure describing the co-array to the 
callee. At each call site, every co-array actual parameter is replaced by an F90 
pointer to its local co-array data and a pointer to the run-time data structure 
describing the co-array. 

Co-array communication generation. Communication events expressed with 
CAF’s bracket notation must be converted into F90; however, this is not straight- 
forward because the remote memory is in a different address space. Although the 
language provides shared-memory semantics, the target architecture may not. 
A CAF compiler must provide transformations to bridge this semantic gap. On 
a hardware shared memory platform, the transformation is relatively straightfor- 
ward since references to remote memory in CAF can be expressed as loads and 
stores to shared locations. The situation is more complicated for cluster-based 
systems with distributed memory. To perform the data movement, the compiler 
must generate calls to a communication library since the data resides on a remote 
node. Moreover, storage must be managed to temporarily hold off-processor data 
to perform a computation. 

Naive translation may lead to situations where excessive storage is used and 
superfluous copying is performed. Eventually, our compiler will automatically 
detect such situations and eliminate the extraneous storage and copying, when 
possible. Compare two statements where remote memory is updated: 

aC : ) [p] = b(:) 

aC : ) [p] = b(:) + 1 

In the first case, a separate communication buffer may not be necessary since 
the data to be sent to processor p is already available in b. On the other hand, 
the second statement calls for local computation; the result should be computed 
into a temporary communication buffer and then transferred to processor p. Now 
consider the case when remote data is used: 
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a( : ) = b(:)[p] + 1 

a( : ) = b ( : ) [p] + c(:)[q] 



In the first statement, no extra communication buffer is necessary because 
we can use a for temporarily storing b( : ) [p] (a becomes dead) to evaluate the 
expression. But for the second case, one extra communication buffer is required 
because we need to transfer two vectors of off-processor data to evaluate the 
expression. 



Key missing features. There are a number of language features that are not 
yet implemented in our preliminary compiler. The most important of these are al- 
locatable co-arrays, co-arrays of user-defined types (including those with pointer 
components), triplet notation in co-dimensions, and multiple co-dimensions. 

5 Writing High Performance Co-array Fortran Code 

Once we completed support for core CAF language features in our prototype 
CAF compiler, we undertook a study of several of the NAS parallel benchmarks 
to understand the interplay of CAF language, compiler, and runtime issues and 
their impact on the programmability, scalability, performance and performance 
portability of applications. From our colleagues Bob Numrich at University of 
Minnesota and Allan Wallcraft at Naval Research Lab, we received draft CAF 
versions of the MG, CG, SP, and BT NAS parallel benchmarks that they cre- 
ated from the MPI codes in the NPB version 2.3 release. Analyzing variants of 
these codes gave us a better understanding of how to develop high performance 
programs in CAF. 

All of the CAF code transformations we describe in this section represent 
manual source-level tuning we applied to CAF sources for the NAS benchmarks 
to best exploit CAF language features for performance. It is our goal to enhance 
the capabilities of our prototype CAF compiler to apply such transformations 
automatically. Our aim is to generate high-performance code that meets or ex- 
ceeds the performance of hand-coded MPI parallelizations from easy to write 
CAF source programs. We are in the process of adding program analysis to our 
compiler to support automating such transformations. 

In our study, we found that there are several key coding strategies for writing 
high performance CAF code. We list them in the decreasing order of importance: 



Communication aggregation and vectorization. This is a critical opti- 
mization for architectures in which the communication fabric does not support 
low-latency, fine-grain memory transactions. Analysis of the NAS benchmark 
loops revealed that all major communication could be vectorized manually us- 
ing triplet notation for subscripts of co-array references. Once support for data 
flow and dependence analysis are in place in our CAF compiler, in most cases it 
should be straightforward to automate this transformation. Consider Figure 1(a) 
which is a simple code fragment from the conj-grad routine of a first-draft CAF 
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do k=nl,n2 

q(k) = w(j ) [reduce_exch_proc (i)] 

j = j + 1 

enddo 



q(nl:n2) = 

w(j : j+n2-nl) [reduce_exch_proc(i)] 



(a) unvectorized (b) vectorized 

Fig. 1. NAS CG before and after communication vectorization 



parallelization of NAS CG that we received from our colleagues. For this code, 
our prototype CAF compiler generates a get for every iteration, which is expen- 
sive. 

Using the triplet notation as in Figure 1(b) enables our present CAF compiler 
prototype to generate a single ARMCI communication event for such a state- 
ment, which is substantially faster than the original. We observed performance 
improvements up to two orders of magnitude by applying this transformation, 
even for relatively small problem sizes of the NAS benchmarks we studied. In 
CAF, when the shapes of source and destination array sections are conformant, 
vectorized communication can be expressed using triplet notation. Otherwise, 
a buffer copy is necessary at the source or destination to yield conformant shapes 
or to pack the data on the sender and unpack the data on the receiver. The latter 
approach mimics the message packing and unpacking in MPF 

Synchronization strength reduction. Analogous to the well-known operator 
strength reduction transformation, synchronization strength reduction involves 
transforming a strong synchronization primitive, e.g., a barrier, into a weaker 
one(s), e.g., point-to-point notify/wait, while preserving the meaning of the pro- 



if( .not. dead(kk) )then 
do axis =1,3 
if( nprocs .ne. 1) then 
call sync_all() 

call give3(axis,+l ,u,nl ,n2,n3,kk) 
call give3(axis,-l ,u,nl ,n2,n3,kk) 

call sync_all() 

call take3(axis,-l,u,nl,n2,n3) 
call take3(axis,+l,u,nl,n2,n3) 
else 

call commlp(axis ,u,nl ,n2,n3,kk) 
endif 
enddo 
else . . . 



ifC .not. dead(kk) )then 
do axis =1,3 
if( nprocs .ne. 1) then 
call sync_notify (nbr (axis , 1 ,kk)+l) 
call sync_notify (nbr (axis ,-l ,kk)+l) 
call sync_wait (nbr (axis , 1 ,kk)+l) 
call sync_wait (nbr (axis, -1 ,kk)+l) 
call give3(axis ,+l ,u,nl ,n2 ,n3,kk) 
call sync_notify (nbr (axis , 1 ,kk)+l) 
call give3 (axis , -1 , u , nl , n2 , n3 , kk) 
call sync_notify(nbr(axis,-l ,kk)+l) 
call sync_wait (nbr (axis , 1 ,kk)+l) 
call sync_wait (nbr (axis, -1 ,kk)+l) 
call take3( axis,-l,u,nl,n2,n3) 
call take3( axis,+l,u,nl,n2,n3) 
else 

call commlp(axis ,u,nl ,n2,n3,kk) 
endif 
enddo 
else . . . 



(a) using barrier synchronization (b) using point-to-point synchronization 

Fig. 2. Communication in NAS MG before and after synchronization strength reduc- 
tion 
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gram, with the aim of improving performance. Others have previously employed 
similar optimizations with significant benefits [13]. This optimization was a key 
performance boost for each of the NAS benchmarks we studied. Figure 2 uses 
a fragment from NAS MG, a 3D multigrid solver, to illustrate this transforma- 
tion. Figure 2(a) shows the original CAF version of the code in which processors 
perform a barrier synchronization before and after exchanging boundary layers 
of its 3D block with its pair of neighbors along a coordinate dimension. This 
code was originally written for the Cray T3E, which has fast hardware support 
for barriers. However, a barrier provides much stronger synchronization than 
necessary; only synchronization with the adjacent neighbors is needed. On clus- 
ter interconnects that do not have fast hardware support for barriers it is more 
efficient to use point-to-point synchronization. Figure 2(b) shows the code re- 
cast to use our new CAF one-way point-to-point synchronization primitives. On 
a Myrinet 2000 cluster, this transformation improved performance by about 30% 
for 64 processors. 

Conversion of Gets into Puts. On communication fabrics such as Myrinet, 
put operations are supported directly, whereas get operations require asking 
a server-side thread to supply the requested data with a put. For such an inter- 
connect, when using regular algorithms, it is feasible and potentially profitable 
to transform each get operation into a put. 

6 Experiments and Discussion 

In this section we compare the performance of the code our compiler generates 
from CAF with hand-coded MPI implementations of the MG, CG, BT and SP 
NAS parallel benchmark codes. For our study, we used MPI versions from the 
NPB 2.3 release. Sequential performance measurements used as a baseline were 
performed using the NPB 2.3-serial release. The NPB codes are widely regarded 
as useful for evaluating the performance of compilers on parallel systems. 

For each benchmark, we compare the parallel efficiency of MPI and CAF 
implementations of each benchmark. We compute parallel efficiency as follows. 
For each parallelization p , the efficiency metric is computed as p xt *°(p p ) ■ I n this 
equation, t s is the execution time of the original sequential version implemented 
by the NAS group at the NASA Ames Research Laboratory; P is the number 
of processors; t p (P, p) is the time for the parallel execution on P processors 
using parallelization p. Using this metric, perfect speedup would yield efficiency 
1.0 for each processor configuration. We use efficiency rather than speedup or 
execution time as our comparison metric because it enables us to accurately 
gauge the relative performance of multiple benchmark implementations across 
the entire range of processor counts. 

All experiments were performed on a cluster of 92 HP zx6000 workstations 
interconnected with Myrinet 2000. Each workstation node contains two 900MHz 
Intel Itanium 2 processors with 32KB/256KB/1.5MB of L1/L2/L3 cache, 4-8GB 
of RAM, and the HP zxl chipset. Our operating environment is the GNU/Linux 
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! notify our partner that we are here and wait for 
! him to notify us that the data we need is ready 
call sync_not if y (reduce_exch_proc (i) +1) 
call sync_wait (reduce_exch_proc(i)+l) 

! get data from our partner 
q(nl:n2) = w(ml :ml+n2-nl) [reduce_exch_proc(i)] 

! synchronize again with our partner to 
! indicate that we have completed our exchange 
! so that we can safely modify our part of w 
call sync_not if y (reduce_exch_proc (i) +1) 
call sync_wait (reduce_exch_proc(i)+l) 

! local computation 

. . . use q, modify w . . . 

Fig. 3. A typical fragment of optimized CAF for NAS CG 



operating system (kernel version 2.4.20 plus patches). Although this Linux kernel 
is SMP-capable, we used only one of the processors on each SMP node for our 
experiments (1) to avoid contention for the Myrinet and local memory, and (2) 
to avoid process ping-ponging since our kernel was not configured to support 
affinity scheduling. We used the Intel Fortran v7.0 for Itanium (efc) as the back- 
end compiler for all F90 code generated by the CAF translator as well as for 
the MPI versions of the benchmarks. Optimization level 3 was used along with 
the override-limits option to prevent the compiler from automatically disabling 
certain expensive optimizations. CAF executables were linked against ARMCI 
1.1-beta for Myrinet GM. All executables were linked against Myricom’s MPI 
implementation MPICH-GM 1.2. 5. .10 (compiled with Intel’s efc) running on 
Myricom’s GM 1.6.4 driver substrate. 

In the following sections, we briefly describe the NAS benchmarks used in our 
evaluation, the key features of their MPI and CAF parallelizations and compare 
the performance of the CAF and MPI implementations. 

6.1 NAS CG 

In the NAS CG parallel benchmark, a conjugate gradient method is used to com- 
pute an approximation to the smallest eigenvalue of a large, sparse, symmetric 
positive definite matrix [9] . This kernel is typical of unstructured grid computa- 
tions in that it tests irregular long distance communication and employs sparse 
matrix vector multiplication. The irregular communication requirement of this 
benchmark is evidently a challenge for all systems. 

On each iteration of loops involving communication the MPI version initiates 
a non-blocking receive from reduce_exch_proc (i) processor followed by an MPI 
send to the same processor. After the send, the process waits until its MPI receive 
completes. Thus, no overlap of communication and computation is possible. 

Our tuned CAF version of NAS CG does not differ much from the MPI hand- 
coded version. In fact, we directly converted two-sided MPI communication into 
equivalent calls to notify/wait and a vectorized one-sided get communication 
event. Figure 3 shows a typical fragment of our CAF parallelization using no- 
tify/wait synchronization. Our experiments showed that for this code, replacing 
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the co-array read (get) operation with a co-array write (put) had a negligible 
effect on performance because of the amount of synchronization necessary to 
preserve data dependences. 

In initial experimentation with our CAF version of CG on various numbers of 
processors, we found that on less than eight processors, performance was signifi- 
cantly lower than its MPI counterpart. In our first CAF implementation of CG, 
the receive array q was a common block variable, allocated in the static data by 
the compiler and linker. To perform the communication shown in Figure 3 our 
CAF compiler prototype allocated a temporary buffer in memory registered with 
ARMCI so that the Myrinet hardware could initiate a DMA transfer. After the 
get was performed, data was copied from the temporary buffer into the q array. 
For runs on a small number of processors, the buffers are large. Moreover, the 
registered memory pool has the starting address independent of the addresses 
of the common blocks. Using this layout of memory and a temporary commu- 
nication buffer caused the number of L3 cache misses in our CAF code to be 
up to a factor of three larger than for the corresponding MPI code, resulting in 
performance that was slower by a factor of five. By converting q (and other ar- 
rays used in co-array expressions) to co-arrays, it moved their storage allocation 
into the segment with co-array data (reducing the potential for conflict misses) 
and avoided the need for the temporary buffer. Overall, this change greatly re- 
duced L3 cache misses and brought the performance of the CAF version back to 
level of the MPI code. Our lesson from this experience is that memory layout of 
communication buffers, co-arrays, and common block/save arrays might require 
thorough analysis and optimization. 

As Figure 4 (a) shows, our CAF version of NAS CG achieves performance 
comparable to that of the MPI version. The parallel efficiency of the CAF and 
MPI codes are almost indistinguishable across a range of processor numbers. 

6.2 NAS MG 

The MG multigrid kernel calculates an approximate solution to the discrete 
Poisson problem using four iterations of the V-cycle multigrid algorithm on 
a n x n x n grid with periodic boundary conditions [9]. The communication is 
highly structured and goes through a fixed sequence of regular patterns. 

In the NAS MG benchmark, for each level of the grid, there are periodic 
updates of the border region of a three-dimensional rectangular data volume 
from neighboring processors in each of six spatial directions. Four buffers are 
used: two as receive buffers and two as send buffers. For each of the three spatial 
axes, two messages (except for the corner cases) are sent using basic MPI send 
to update the border regions on the left and right neighbors. Therefore, two 
buffers are used for each direction, one buffer to store data to be sent and the 
other to receive the data from the corresponding neighbor. Because two-sided 
communication is used, there is implicit two-way point-to-point synchronization 
between each pair of neighbors. 

The CAF version of MG mimics the MPI version. The communication buffers 
used in the MPI version are replaced by co-arrays; the communication is ex- 
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NAS CG (Class C) Efficiency on Itanium2 + Myrinet 2000 



NAS MG (Class C) Efficiency on Itanium2 + Myrinet 2000 




pressed using CAF syntax, as opposed to using MPI primitives. This approach 
requires explicit synchronization. The example code is shown on Figure 2 (b). 
The give3 procedure performs a one-sided put to the appropriate neighbor. Be- 
cause of communication buffer reuse, two sync_notify are necessary to signal 
our left and right neighbors that our receive buffers are ready to receive data from 
them; two following sync_wait ensure that the remote buffers on the left and the 
right neighbors are ready for us to send data. The sync_notify following each 
give3 call is matched by the neighbor’s sync_wait and signals the completion of 
the put. Similarly, our sync_wait matches the neighbor’s sync_notify signaling 
that the data transfer from the neighbor is complete and we can proceed to the 
unpacking phase in take 3. 

As the performance graph on Figure 4 (b) illustrates, our CAF version of 
NAS MG achieves comparable performance to that of the MPI version. 

6.3 NAS SP and BT 

As described in a NASA Ames technical report [9], the NAS benchmarks BT 
and SP are two simulated CFD applications that solve systems of equations re- 
sulting from an approximately factored implicit finite-difference discretization 
of three-dimensional Navier-Stokes equations. The principal difference between 
the codes is that BT solves block-tridiagonal systems of 5x5 blocks, whereas SP 
solves scalar penta-diagonal systems resulting from full diagonalization of the 
approximately factored scheme [9]. Both consist of an initialization phase fol- 
lowed by iterative computations over time steps. In each time step, boundary 
conditions are first calculated. Then the right hand sides of the equations are 
calculated. Next, banded systems are solved in three computationally intensive 
bi-directional sweeps along each of the x, y, and z directions. Finally, flow vari- 
ables are updated. During each timestep, loosely-synchronous communication is 
required before the boundary computation, and tightly-coupled communication 
is required during the forward and backward line sweeps along each dimension. 
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lhs ( 1 :BLOCK_SIZE, 1 :BLOCK_SIZE, 
cc, -1, 

0 : JMAX-1 , 0 : KMAX-1 , 
cr) [successor (1)] = 
lhs ( 1 :BLOCK_SIZE, 1 :BLOCK_SIZE, 
cc, cell_size(l,c)-l, 

0: JMAX-1, 0: KMAX-1, c) 



.... pack into out_buff er_local . 

out_buf f er (1 :p, stage+1 : stage+1) 
[successor (1)] = 
out_buffer_local(l :p, 0:0) 

.... unpack from out_buffer 



(a) NAS BT (b) NAS SP 

Fig. 5. Forward sweep communication in NAS BT and NAS SP 



Because of the line sweeps along each of the spatial dimensions, traditional 
block distributions in one or more dimensions would not yield good parallelism. 
For this reason, SP and BT use a skewed block distribution called multiparti- 
tioning [9, 14]. With multi-partitioning, each processor handles several disjoint 
blocks in the data domain. Blocks are assigned to the processors so that there is 
an even distribution of work for each directional sweep, and that each processor 
has a block on which it can compute in each step of every sweep. Using multi- 
partitioning yields full parallelism with even load balance while requiring only 
coarse-grain communication. 

The MPI implementation of NAS BT and SP attempts to hide commu- 
nication latency by overlapping communication with computation, using non- 
blocking communication primitives. For example, in the forward sweep, except 
for the last tile, non-blocking sends are initiated in order to update the ghost 
region on the next tile. Afterwards, each process advances to the next tile it is 
responsible for, posts a non-blocking receive, performs some local computation, 
then waits for the completion of both non-blocking send and receive. The same 
pattern is present in the backward sweep. 

The CAF implementation for BT and SP inherits the multipartitioning 
scheme used by the MPI version. In BT, the main working data resides in co- 
arrays, while in SP it resides in non-shared arrays. For BT, during the boundary 
condition computation and during the forward sweep for each of the axes, no 
buffers are used for packing and unpacking, as shown in Figure 5(a). On the 
contrary, in SP all the communication is performed via co-array buffers (see Fig- 
ure 5(b)). This shows that when an application is written in the spirit of the 
Co-array Fortran programming model, it might require less memory copies. In 
the backward sweep, both BT and SP use auxiliary co-array buffers to commu- 
nicate data. 

In our CAF implementations, we had to consider the trade-off between the 
amount of memory used for buffers and the amount of necessary synchroniza- 
tion. By using more buffer storage we were able to eliminate both output and 
anti-dependences due to buffer reuse, thus obviating the need for extra synchro- 
nization. We used a dedicated buffer for each communication event during the 
sweeps, for a total buffer size increase by a factor of square root of the number 
of processors. Experimentally we found that this was beneficial for performance 
while the memory increase was acceptable. 
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NAS BT (Class C) Efficiency on Itanium2 + Myrinet 2000 NAS SP (Class C) Efficiency on Itanium2 + Myrinet 2000 





Fig. 6. Comparison of MPI and CAF parallel efficiency for NAS SP and BT 



The performance graphs in figure 6 show that the CAF version performs 
consistently better than the MPI version for BT, but is about 5% slower for SP. 
For both benchmarks, our compiler uses blocking communication primitives. By 
applying hand optimization to the code generated by our CAF compiler for 
SP, we discovered that using non-blocking communication enables us to achieve 
performance comparable to that of MPI. 

We observed that even though we used blocking communication for both BT 
and SP, we only paid a performance penalty for SP. This difference is due to 
the computation and communication characteristics of the benchmarks. Mea- 
surements showed that BT communicates half the number of messages that SP 
does, whereas the communication volume is about 2/3 of the communication 
volume for SP. Therefore, in BT the communication is less frequent than in SP, 
and consists of larger messages. As a consequence, overlapping computation with 
communication is more critical for performance in SP. 

6.4 Discussion 

In the course of our experimentation on our Itanium2+Myrinet2000 cluster, we 
observed that allocating co-arrays and temporary communication buffers in reg- 
istered memory provides a noticeable boost in performance. Myrinet is only able 
to perform DMA on registered (pinned) pages. If all local variables involved 
in communication are allocated in registered memory, they can be used by the 
communication library directly, without copying into temporary buffers allocated 
from a pool of registered memory. In our prototype compiler, we don’t automat- 
ically migrate local variables involved in communication into pinned memory; 
instead, we accomplished this by modifying the source code to turn them into 
co-arrays that are never referenced remotely. Automatically migrating local vari- 
ables into co-array storage can be complex to do because of the need to preserve 
sequence association among local variables in common blocks and Fortran data 
initialization statements. 
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Our experiments showed that using split-phase, non-blocking communication 
and overlapping computation with DMA transfers significantly boosts perfor- 
mance. Our prototype compiler implements communication by an in-place con- 
version of language-level communication constructs into blocking gets and puts. 
Once we have a framework for data-flow and dependence analysis in place, we 
will be able to automatically translate blocking communication into split-phase, 
non-blocking equivalents to effectively overlap communication with computation. 

At a higher level, the original semantics of CAF as defined by Numrich and 
Reid [4, 5] require an implicit sync_memory at each procedure call boundary to 
complete any outstanding gets or puts. This requirement makes it impossible to 
overlap communication with a procedure call that does not use any of the data 
involved in communication. Requiring this implicit sync_memory for SP would 
remove a significant opportunity for latency hiding that is exploited by the MPI 
hand-coded parallelization, so we believe that this language requirement should 
be dropped. 

In the first draft CAF versions of the NAS benchmarks, we found that 
there were frequent references to read-only data that was stored off-processor. 
For example, there were frequent off-processor references to the variable 
reduce_send_start in the initial CAF version of CG, to the communication 
buffer offsets for each face in SP, and to the cell size information of neighboring 
processors in BT. We improved code performance by fetching these values once 
after they had been initialized and storing them locally. With interprocedural 
analysis to determine that these variables are essentially run-time constants, we 
could potentially apply this transformation automatically. 

7 Conclusions 

This paper presents an overview of the issues that we have been grappling with 
as we work on (1) refinement of the CAF language to make it the programming 
model of choice for portable, high-performance scientific programming in Fortran 
and (2) design and implementation of a portable and retargetable CAF com- 
piler. Preliminary performance results for several NAS benchmarks on a cluster 
of workstations show that CAF is capable of achieving good performance despite 
the current lack of automatic communication and synchronization optimizations 
in our prototype CAF compiler. We were able to achieve performance compa- 
rable to highly-tuned, hand-coded MPI versions of the same benchmarks. The 
expressive syntax and explicit, one-sided communication model enabled us to 
manually perform key optimizations such as communication vectorization and 
synchronization strength reduction in the CAF source code. The CAF model 
is expressive enough to allow the user to perform these transformations man- 
ually when a compiler cannot. This is in contrast to HPF, for example, where 
it is more difficult for a user to improve code performance through source code 
adjustments. 

While we performed optimizing transformations manually on our CAF source 
code in this preliminary work, it is our intention to improve our prototype CAF 
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compiler to perform automatically most of the optimizing transformations de- 
scribed in this paper. An advantage of writing parallel programs in CAF over 
MPI is that because communication and synchronization are expressed at the 
language level, it is possible for a compiler to analyze and tailor code to whatever 
target platform is at hand. On a shared memory architecture, CAF accesses to 
remote data can simply be turned into loads and stores; performing such a rad- 
ical transformation on an MPI program would be exceedingly difficult. While 
it may be possible to annotate MPI libraries so that compilers could under- 
stand the semantics of the communication expressed by library calls, CAF offers 
a simpler, coherent model for parallel programming. 

Because CAF is amenable to automatic analysis and transformation, it is 
possible and desirable to express computation and communication in a natural 
and general way, leaving the burden of platform-specific code tuning to the com- 
piler. This is important because user-applied optimizations that perform well on 
one architecture may actually be counter-productive on a different architecture. 
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Abstract. We evaluate the impact of programming language features 
on the performance of parallel applications on modern parallel archi- 
tectures, particularly for the demanding case of sparse integer codes. 
We compare a number of programming languages (Pthreads, OpenMP, 
MPI, UPC) on both shared and distributed-memory architectures. We 
find that language features can make parallel programs easier to write, 
but cannot hide the underlying communication costs for the target par- 
allel architecture. Powerful compiler analysis and optimization can help 
reduce software overhead, but features such as fine-grain remote accesses 
are inherently expensive on clusters. To avoid large reductions in perfor- 
mance, language features must avoid degrading the performance of local 
computations. 



1 Introduction 

Parallel computing can potentially provide huge amounts of computation power 
for solving important problems in science and engineering. However, the difficulty 
of writing parallel programs poses a major barrier to exploiting the power of 
parallel architectures. Programming is especially difficult for applications with 
irregular, fine-grain memory access patterns, since current parallel programming 
languages, tools, and architectures are evolving in directions less suited for these 
codes. Three vital goals are in conflict when choosing a parallel programming 
paradigm for clusters of shared- memory multiprocessors: 

— Exploitation of maximum machine performance on a particular platform. 

— Portability of code and performance across various high performance com- 
puting platforms. 

— Programmability: easy creation of correct, reliable and efficient programs. 
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Parallel programming languages are designed by making different tradeoffs, de- 
pending on assumptions of the underlying compiler, runtime system, hardware 
support, target application characteristics, and acceptable user effort. 

For embarrassingly parallel applications with coarse-grain communication, 
the choice of a parallel programming language is less important since almost all 
languages can achieve good performance with low programmer effort. Unfortu- 
nately, no current parallel programming paradigm is satisfactory for more com- 
plex applications with fine-grain parallelism and irregular remote accesses. MPI 
is the most portable and achieves the best performance on distributed-memory 
machines for most codes, but is difficult to program and is inefficient for ap- 
plications with many irregular fine-grained accesses. OpenMP and Pthreads are 
simple and efficient on shared-memory nodes, but do not work well (if at all) on 
clusters. HPF is portable but limited in its flexibility and applicability. Java is 
popular but does not yet have widely adopted libraries/ APIs for efficient parallel 
execution on clusters. 

A promising approach for easing the task of writing codes with fine-grain 
parallel accesses is to use programming languages that provide flexible remote 
accesses and support for a shared address space, such as UPC and Co- Array For- 
tran. These hybrid languages simplify code development because programmers 
can rely on language support for fine-grain remote accesses to get a working 
version quickly, before selectively putting effort into modifying a small subset 
of the code for enhanced performance. In comparison, programming paradigms 
such as MPI require explicit communications to be inserted throughout the code 
for correctness. 

A problem with this hybrid approach is the architectural trend towards 
building high-end supercomputers from clusters of PCs or shared- memory mul- 
tiprocessors (SMPs) using commodity parts, since this approach yields sys- 
tems with expensive, high latency inter-processor communication. As a result 
users are gravitating towards parallel programming paradigms such as MPI that 
can efficiently support coarse-grain bulk communications. Parallel programming 
paradigms such as UPC that rely on fine-grained remote accesses may find it 
difficult to achieve good performance on clusters, because the underlying archi- 
tecture does not efficiently support such operations. 

Our goal in this paper is to evaluate and quantify the performance of parallel 
language features based on experimental evaluations of a number of challeng- 
ing parallel applications, particularly those requiring fine-grain remote accesses. 
We identify programming language features that can reduce programmer effort 
and quantify the overhead encountered when using such features. We attempt to 
determine the feasibility of using a hybrid fine and coarse-grain parallel program- 
ming model on cluster architectures. We pay special attention to the performance 
of UPC because it is the first widely available commercially supported high-level 
parallel programming language that provides flexible non-local accesses for both 
shared and distributed memory paradigms. We also attempt to place our evalu- 
ation in the context of ongoing trends in parallel architectures and applications. 
More specifically, the contributions of this paper include: 
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1. Experimental evaluation of language features for challenging irregular par- 
allel applications. 

2. Observations on programmability and performance for Pthreads, OpenMP, 
MPI, and UPC. 

3. Suggestions for achieving both programmability and good performance in 
the future. 

4. Predictions on impact of architectural developments on performance of par- 
allel language features. 

While our findings that fine-grained parallel applications perform poorly on clus- 
ter architectures is not surprising, our study quantifies the performance penalty 
for interesting programming languages using several challenging irregular bench- 
marks. 

In the remainder of the paper, we explain our choice of evaluation parame- 
ters (applications, parallel languages) and present our experimental results. We 
present our observations on programming language features and their impact on 
performance, followed by a number of suggestions for their usage in developing 
parallel applications. We conclude with a discussion of the impact of architecture 
trends and comparison with related work. 

2 Applications 

Many scientific applications have very regular memory access patterns and can 
be easily parallelized and implemented efficiently for a large number of parallel 
architectures. We chose for our evaluation three application classes that are 
more complex and represent challenging test cases for parallel programming 
paradigms. The three types of parallel applications are: 

Irregular table update Many parallel database operations can be viewed as mak- 
ing irregular parallel accesses to a large distributed table of values. If the ac- 
cesses perform associative reduction operations (e.g., summation), the applica- 
tion is similar to a large histogram and may be implemented using a coarse- 
grain bucket algorithm. Accesses may also perform arbitrary read-modify- write 
operations, in which case fine-grain algorithms are necessary. The amount of 
computation in table updates is static and may be distributed evenly at compile 
time. Table update has potentially very high communication requirements. 

Irregular dynamic accesses A second class of challenging parallel applications 
perform irregular parallel accesses to sparse data structures. The application may 
allow a limited amount of coarse-grained accesses. The amount of computation 
is static and may be distributed evenly at compile time, and has very high 
communication requirements. 
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Integer sort Large in-memory sorting is a third parallel application class that is 
surprisingly difficult to perform efficiently on distributed-memory parallel archi- 
tectures. Many parallel implementations are possible, including both coarse and 
fine-grained algorithms. Sorting has high communication requirements. 

All three types of benchmarks are characterized by irregular memory access 
to large data structures. Depending on the benchmark, both coarse and fine- 
grained remote accesses may be necessary. 

3 Programming Paradigms 

Broadly speaking, parallel paradigms can be classified as shared-memory with 
explicit threads (Pthreads, Java threads), shared-memory with task/data paral- 
lelism (OpenMP, HPF), distributed memory with explicit communication (MPI, 
SHMEM, Global Arrays), or distributed-memory with special global accesses 
(Co- Array Fortran, UPC). We describe paradigms used in our study in more 
detail. 

Pthreads (POSIX threads) is a shared-memory programming model where paral- 
lelism takes the form of parallel function invocations [9]. A parallel function body 
is executed in parallel by many threads, which can all access shared global data. 
Pthreads is the underlying implementation of parallelism for many programming 
paradigms. Java is a general purpose programming language that supports par- 
allelism in the form of threads [10]. Parallel Java programs on SMPs resemble 
Pthreads programs. Pthreads and Java are available only on SMPs. 

OpenMP is a shared-memory programming model where parallelism takes the 
form of parallel directives for loops and functions [4] . OpenMP directives specify 
loops whose iterations should be executed in parallel, as well as functions that 
may be invoked in parallel. Additional directives specify data that should be 
shared or private to each thread. Compilers translate OpenMP programs into 
code that resembles Pthreads programs, where parallel loop bodies are made 
into parallel functions. OpenMP is an industry standard and is supported in 
many languages and platforms. OpenMP is currently available only on SMPs. 

MPI (Message Passing Interface) is a distributed-memory programming model 
where threads explicitly communicate using functions in the MPI run-time li- 
brary to send and receive messages [8] . It also includes a large selection of efficient 
collective communication routines. MPI is widely available (virtually every par- 
allel platform) and well tuned for performance. Despite the programming effort 
required, MPI is the current programming paradigm of choice for its portability 
and performance. 

UPC (Unified Parallel C) is a shared-memory programming model based on 
a version of C extended with global pointers and data distribution declara- 
tions for shared data [3]. Accesses via global pointers are translated into inter- 
processor communication by the UPC compiler. A distinguishing feature of UPC 
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is that global pointers may be cast into local pointers for efficient local access. 
Explicit one-way communication similar to SHMEM [?] is also supported in the 
UPC run-time library via routines such as upc_memput() and upc_memget(). 
It is the compiler’s responsibility to translate memory addresses and insert 
inter-processor communication. UPC is the first commercially supported parallel 
paradigm that supports flexible remote accesses to a shared memory abstraction. 

4 Performance Evaluation 

We believe performance is a key factor (if not the key factor) determining the 
success of parallel programming paradigms. To gain insight into the factors un- 
derlying performance, we performed an experimental performance evaluation of 
a number of programming paradigms on the following parallel platforms. 



Compaq AlphaServer SC. A 64-node cluster located at ORNL. Each node 
is an SMP with 2GB of memory, four ES-40 processors, and a single Quadrics 
network adapter. The nodes run AlphaServer 2.0 OS, the MPI implementation 
is built on the native Quadrics libraries. 



Sun SunFire 6800. A 24-processor Sun shared-memory multiprocessor located 
at the University of Maryland, with UltraSparc III processors, 24GB memory, 
and crossbar interconnect running SunOS 5.8. 

4.1 Table Update 

TableUpdate performs irregular updates on a large distributed hash table. Up- 
dates are commutative and may be reordered. Several different versions of Table- 
Update are used: 

— MPI. Coarse-grain algorithm uses buckets to store updates to data on other 
processors. All buckets are synchronously exchanged between processors once 
buckets are filled. Upon receiving buckets, updates in bucket are applied to 
the local portion of the table. 

— UPC. Fine-grained algorithm uses global pointers to update non-local table 
elements. 

— UPC (bucket). Coarse-grain algorithm also uses bucketized algorithm as in 
MPI code. One-way explicit communication used to transfer buckets between 
processors. 

— C with Pthreads. Shared-memory code uses parallel function calls to update 
table elements. All threads directly access table as shared array. 

— C with OpenMP. Shared-memory code parallelizes loops computing table 
elements using OpenMP annotations. 

— Java. Shared-memory code uses Java threads to update shared global table. 
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Figure 1 presents the performance of TableUpdate for a table of size 2 22 on 
a Compaq AlphaServer SC for MPI, UPC, and UPC (bucket). Performance is 
measured in number of table updates per millisecond per processor, and is pre- 
sented using log scale. Results show that MPI greatly outperforms UPC, though 
UPC using a coarse-grain bucket algorithm can approach the performance of 
MPI. UPC suffers significant performance degradations when using fine-grain 
access patterns because of software and hardware overhead in making point- 
wise remote accesses. 

We next examine TableUpdate performance on a Sun SunFire SMP. Results 
in Figure 2 show that Java, C with Pthreads, and C with OpenMP implementa- 
tions of TableUpdate achieve comparable performance, though Java performance 
is slightly higher (possibly because it is better tuned for performance by the ven- 
dor). The SUN UPC compiler has significantly poorer performance because of 
software overhead in translating point-wise accesses to shared data. 

4.2 Conjugate Gradient 

The conjugate gradient benchmark (NAS CG benchmark) finds the principal 
eigenvalue of a sparse n x n real matrix A with random pattern of kn nonzeros 
using the inverse power method [1], This involves solving a linear system of 
the form Ap = z for different vectors z. The solver uses the conjugate gradient 
method and repeatedly calculates the sparse matrix-vector product w = Av, 
where v,w are dense vectors of length n. This benchmark is widely used and 
stresses memory and communication performance. We evaluated the following 
versions of CG: 

— MPI. This Fortran 77 version was taken from the NAS 2.3 suite, and uses 
explicit MPI communication operations. The implementation uses a (block, 
block) distribution of A, and replicates the appropriate section of v for the 
dot product with the corresponding section of A. The total size of the im- 
plementation is 1800 lines. 

— OpenMP. This is a shared-memory implementation in C with OpenMP di- 
rectives, derived from the NAS 2.3 serial code by the RWC in Japan, and 
has total size of 900 lines. This implementation uses a static partition across 
processors of the row- loop of the matrix- vector product. A long-lived parallel 
region is used to reduce overheads between successive sparse-matrix vector 
products. OpenMP work distribution directives are inserted for initializa- 
tions, sparse matrix-vector product, and dot products in the algorithm. 

— UPC (OpenMP). This UPC implementation was derived from the OpenMP 
shared-memory version. About 1/3 was rewritten from OpenMP, and 1/4 
was added new. The total size of this version is 1300 lines. It distributes the 
matrix A using a block-cyclic distribution with a large block size. This is 
the best distribution for this problem that can be expressed directly in UPC 
without explicitly partitioning the matrix A. Work is partitioned between 
processors in the sparse vector-matrix product according to the portions of A 
held by each processor. The vector v is replicated to reduce communication; 
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the default strategy of distributing the shared vector leads to run times that 
are two orders of magnitude larger due to the repeated fine-grain random 
accesses to v in the sparse matrix-vector product. 

— UPC (MPI). This UPC implementation more closely follows the MPI algo- 
rithm. It uses an explicit (blocked,*) distribution of A and replicates the 
vector v. Coarse-grain data movement (e.g., upc_memget(), upc_memput()) 
is used to replicate the result w. The total size of this version is 1600 lines. 

Figure 3 presents our results for a class B problem size for CG on the Al- 
phaServer SC. Results are reported in MFLOPS per processor. The total number 
of FLOPS required is defined by the problem size. OpenMP results are only avail- 
able up to the 4 processors on each node and scale relatively poorly due to the 
replication of v into the processor caches through misses on v in random order. 
MPI outperforms both versions of UPC, though the UPC (MPI) implementation 
is closer in performance. 

The sequential performance of the UPC implementations is 50-60% of the sin- 
gle processor MPI and OpenMP performance. The MPI implementation achieves 
a speedup of 10.4 with 16 processors, and a speedup of 17.6 with 32 processors. 
The UPC (OpenMP) speedup is 4.0 with 16 processors, and 5.0 with 32 proces- 
sors, hence performs at only 28% of the MPI implementation at 32 processors. 
The UPC (MPI) speedup is better at 7.0 with 16 processors, and 9.1 with 32 
processors, hence performs at 52% of the MPI implementation at 32 processors. 

The performance of CG is heavily dependent on memory system performance. 
For comparison, a vectorized implementation of the CG benchmark achieves 
about 1,500 MFLOPS on a single processor of an NEC SX-6, and about 1,100 
MFLOPS per processor using all eight processors of an SX-6 node. 

4.3 Integer Sort 

Integer sort performs a parallel radix sort of a large collection of integer data. We 
timed MPI and UPC implementations on an AlphaServer SC. Both implemen- 
tations used coarse-grain parallel algorithms employing bulk explicit messages, 
since a fine-grain UPC implementation was found to be intolerably inefficient. 
A 128K key input data size is used. Performance is reported as efficiency. Re- 
sults in Figure 4 show that MPI outperforms UPC slightly, with the difference 
increasing for larger numbers of processors. 

4.4 UPC Microbenchmark 

Our experimental results for entire applications showed that fine-grain algo- 
rithms were exceedingly inefficient for cluster architectures. We repeated our 
experiments using the Berkeley UPC compiler [2] on the AMD Athlon PC clus- 
ter at Ohio, and UPC performance was only slightly improved relative to MPI. 

The problem, we believe, is caused by the overhead of fine-grained accesses 
in UPC. UPC provides global shared pointers that can easily access non-local 
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data, providing a convenient shared-memory abstraction for parallel program- 
ming. Though a shared data element can be accessed in a completely transparent 
fashion by any process executing on any processor, the overhead of direct point- 
wise access can be quite significant. To quantify both the hardware and software 
overheads in greater detail, we used UPC microbenchmarks to evaluate perfor- 
mance on a wide range of parallel architectures. 

— Compaq AlphaServer SC system (Falcon) at Oak Ridge National Laboratory, 
running Version 1.7 of the Compaq UPC compiler. 

— Single node AlphaServer Marvel at University of Florida, running Version 
2.1 of the Compaq UPC compiler. 1 

— AMD Athlon Cluster (64 dual-processor nodes) with Myrinet interconnect 
at the Ohio Supercomputer Center, running the Berkeley UPC compiler. 

— SUN SunFire 6800 system (24-nodes) at the University of Maryland, running 
the Sun UPC compiler. 

— Cray T3E system at Michigan Tech University, running the original UPC 
compiler. 

— SGI Origin 2000 at University of North Carolina, running the Intrepid UPC 
compiler. 

We measured the cost of direct point-wise shared data access costs, using 
both private and shared pointers. Figure 5 shows the per-word access cost using 
a read-modify-write (increment-by-one) operation on floating-point doubles, for 
various modes of access: 

— Private: local shared data that is accessed as private data via casting UPC 
pointer to private. 

— Shared-local: local shared data that accessed directly as using a UPC shared 
pointer. 

— Shared-same-node: non-local shared data that is local to another process on 
the same SMP node. 

— Shared-remote: non-local shared data that is on a different node. 

It can be observed that on all systems, there is a significant difference in 
the access time for private data and shared-local data, even though there is no 
data movement involved with the latter. The difference represents the overhead 
of translating a shared UPC reference into a node-address pair. This overhead 
was over 500 times a local memory access cost on the Compaq AlphaServer with 
the earlier version (vl.7) of the Compaq UPC compiler. Compiler enhancements 
have reduced the overhead in later versions (v2.1) of the compiler to around 100 
times the private data access cost. 

Another area where compiler optimization can reduce software overhead for 
memory access costs was in accessing non-local data located on the same node 
(belonging to another thread on the same node). More powerful compiler op- 
timizations can use more efficient local memory accesses in this situation, as 

1 The authors would like to thank Dr. Alan George at the University of Florida- 
Gainesville for providing access to this machine. 
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demonstrated by the newer Compaq UPC compiler (v2.1). Nonetheless, even 
with both optimizations (for local shared data and same-node shared data), 
memory access costs are still two orders of magnitude higher than access to pri- 
vate memory for UPC on the AlphaServer Marvel system. Fine-grain non-local 
accesses must therefore be used sparingly if at all in performance critical sections 
of a parallel UPC program. 



4.5 Evaluation Summary 

Summarizing our results, we find on SMPs threads-based paradigms are clos- 
est to the underlying hardware and provide the best performance. On clusters, 
paradigms with explicit communication have the lowest overhead and achieve 
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the best performance. UPC programs can achieve good performance when writ- 
ten in a similar coarse-grain style using bulk communication routines, otherwise 
performance can be extremely poor. 



5 Language Features 

Based on our experimental evaluation, we present some observations and sug- 
gestions with high-level language features. A number of parallel programming 
languages provide language features for providing the illusion of shared mem- 
ory. The UPC programming model provides access to cyclically distributed 
shared arrays through global pointers, though when accessing only local por- 
tions of a shared array, global pointers may be cast back into local pointers 
for greater efficiency. In addition, the UPC run-time library also provides one- 
way, coarse- grained explicit communication primitives through functions such 
as upc_memget() and upc_memput(). We make the following observations about 
these language features: 



A global shared memory programming model is easy to use. At the 

core of the UPC programming model is the ability to easily access non-local 
data in a parallel program simply through global pointers. Programmers need 
only specify data that is to be distributed across processors, and reference them 
through special global pointers. The fine-grained UPC programming model is 
very simple and easy to use. The resulting code is cleaner and more maintainable 
than paradigms such as MPI that require explicit communication in the program. 



User level shared memory is not a good reflection of clusters. While the 
programming model may allow easy fine-grain access to non-local data, this is not 
supported by the underlying hardware architecture. The interconnect between 
nodes of a cluster typically provides high bandwidth but also long latencies, 
making aggregate coarse-grained communication much more efficient than many 
fine-grained remote accesses. This problem will only worsen as future parallel 
architectures continue to evolve towards clusters of SMPs. In comparison, the 
coarse-grain one-way communication primitives in many languages more accu- 
rately reflect the actual communication mechanisms supported by the hardware. 



A shared-memory programming model can encourage poor perfor- 
mance on clusters. Because the fine-grained shared-memory programming 
model is so seductive, one can argue that it actually leads to poor performance 
by encouraging programmers to write fine-grain codes that execute poorly on 
clusters. Programmers can code around this problem, but usually only at the 
cost of complicating the programming model or changing their coarse-grain al- 
gorithm. 
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We are dubious that compiler techniques will solve this problem. Given 
the lack of hardware support for efficient fine-grain communication on clusters, 
we believe programmers will need to develop parallel algorithms with coarse- 
grain block data movement to achieve good performance. Compilers can remove 
some of the inefficiencies of fine-grain communication, but cannot robustly trans- 
form fine-grain parallel algorithms into efficient block parallel codes for clusters. 



The (hybrid) programming model can combine fine-grain and coarse- 
grain accesses. One advantage of the UPC programming model is that it 
allows integration of fine-grain remote accesses with global pointers and coarse- 
grain explicit communication using library routines such as upc_memput() and 
upc_memget(). As we stated previously, a hybrid programming paradigm such 
as UPC can ease the development and maintenance of parallel codes. Most of 
the program may be written cleanly using global pointers, inserting explicit 
coarse-grain communication only for performance critical sections. Our experi- 
mental evaluation shows that when done well, the resulting codes can achieve 
performance close to MPI on clusters. However, programmers must be extremely 
careful because the cost of using global pointers for remote accesses is so high. 
Developing coarse-grain parallel algorithms for performance-critical sections of 
the program may also require extensive modifications to the algorithms and data 
structures used in the code. 

Programming language features must avoid degrading local computa- 
tions. Many computations in parallel programs can be performed on purely local 
or previously prefetched remote data. Parallel programming languages should be 
designed so that these local computations can be compiled (and optimized) by 
the native sequential compiler. Otherwise performance can degrade, sometimes 
significantly. A great deal of the success of MPI can be attributed to following 
this rule, since all computations depend only on local data after calls to MPI 
communications functions return. In comparison, UPC require user-inserted ex- 
plicit copies of remote global data to local buffers (or casting global pointers to 
local if shared data is alrady local) to avoid excessive overhead. Simply accessing 
global shared data is too expensive, even though the global data may be com- 
pletely located locally. For instance, accessing local data using a global pointer 
in UPC can result in over 100 times slowdown, is actually local. 

5.1 Advice on Choosing Parallel Paradigms 

We summarize our observations on the parallel language features as follows. Even 
though a language like UPC may support a fine-grain programming model, it 
can achieve respectable performance on clusters only if fine-grain remote accesses 
are used sparingly. Coarse-grain parallel algorithms and bulk communication are 
still essential for achieving good performance. For fine-grain parallel algorithms, 
even though language and compiler support can improve performance compared 
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to naive implementations, absolute performance on clusters is likely so poor that 
differences will be insignificant. 

Based on our experiences, we believe that the prime factor in choosing par- 
allel paradigms is the nature of the algorithm. For coarse-grain parallel algo- 
rithms on clusters, many choices are possible. For peak performance, explicit 
message passing paradigms such as MPI and SHMEM will likely provide the 
best performance. If program development time is an issue, choosing a hybrid 
UPC implementation and selectively using bulk and collective communication 
such as upc_memget() and upc_memput() routines in computationally intensive 
portions of the program can be useful. Programming effort can also be reduced 
by exploiting existing libraries where possible. For fine-grain parallel algorithms, 
there are fewer options. Implementations on clusters using only fine-grain lan- 
guage features are likely to be extremely slow. If the data size is small, these 
codes may be executed on SMPs. Otherwise coarse-grain alternatives should be 
developed if possible. 

On the Cray T3E (the original platform for UPC), UPC appears to be an 
unqualified success and one of the best possible choices for a programming lan- 
guage/paradigm. However, the suitability of fine-grain programming languages 
for cluster environments, with higher latencies and message overheads, is unclear. 
Obtaining good performance from a shared memory in a cluster environment 
requires programming in specific and sometimes convoluted styles, discarding 
many of the easy of use features of the language. Advancing compiler technol- 
ogy can help in some cases, but still results in an environment with a complicated 
and opaque performance model. The ability of a programmer to write a compli- 
cated fine-grain parallel program and have confidence that it will achieve good 
performance across a range of platforms still seems a distant dream. 

6 Impact of Trends in Parallel Architectures 

We also wish to evaluate parallel language features in the context of ongoing 
architectural developments. Here we examine developments and trends in parallel 
computer architectures and their impact on parallel programming paradigms. 



Faster interconnects. High-speed cluster interconnects continue to improve 
in bandwidth and latency. Both proprietary interconnects (e.g., Quadrics Elan 
used in Compaq AlphaServer) and systems for connecting commodity proces- 
sors (e.g., SCI, Dolphin, Myrinet, VIA, InfiniBand) are improving in perfor- 
mance. Such interconnects also offer better support for shared memory, small 
messages, and one-sided communication, and thus may improve fine-grain com- 
munication performance. On the other hand, while the absolute performance of 
inter-processor communication is steadily improving, the cost of communication 
relative to computation continues to increase due to ever faster nodes and pro- 
cessors. We see no technological developments that will reduce or even slow this 
gap in the near future. 
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Larger memories. Although memory latency is increasing relative to processor 
speeds, memory size is increasing due to greater chip densities. As memory prices 
continue to drop, it is becoming possible to construct parallel systems with much 
larger amounts of memory than in the past. Cluster and MPP systems can now 
be built with several Terabytes of memory, and even SMPs can be purchased with 
256 Gigabytes or more of memory. Continuing increases in SMP memory size 
may allow them to run (commercial) applications previously limited to MPPs 
and clusters, reducing the demand and vendor support for more complicated 
programming models. 

Processor /memory integration. Processor-in-memory (PIM) designs can 
potentially offer enormous improvements for specific problems by providing ef- 
ficient parallel operations on data. However, they do not obviate the need for 
inter-processor communication. Hence the general utility of such designs will 
still depend on communication performance. Specific aspects of PIM designs 
may start to appear in memory controllers for conventional systems, but are 
probably still a few years away. In general, PIM- like systems will likely increase 
the cost of non-local memory accesses relative to computation, increasing rather 
than reducing the difficulty of efficient parallel programming. 

Multithreading. Microprocessor design seems to be heading towards greater 
support for multithreading to tolerate increasing memory latencies. Increasing 
levels of task-level multithreading will start to make even single processor nodes 
on MPP systems resemble SMPs, and likely accelerate the shift into hybrid 
programming models suitable for cluster architectures. 

The good news is that as parallel architectures improve, programs will be 
able to process larger irregular problems more quickly. The bad news is that the 
efficiency of parallel programs will continue to decrease. 



7 Related Work 

Obviously there is a tremendous amount of research on parallel language design 
and benchmarking. The most relevant to this paper is the recent work analyzing 
the performance of UPC. El-Ghazawi et al. have been developing and bench- 
marking UPC codes [5, 6, 7] and have discovered performance can be respectable, 
if a coarse-grain programming style is adapted. Yelick et al. have actually devel- 
oped their own UPC translator/compiler [2]. Their experiments show similar re- 
sults, that fine-grain accesses are significantly more expensive, and performance 
improves if the compiler can aggregate remote accesses to reduce costs. In com- 
parison, we study a wider range of parallel languages on a slightly different set 
of applications. Pugh and Spacco use similar benchmarks to evaluate MPJava, 
a method for developing high-performance parallel computations in Java [11]. 



208 



Konstantin Berlin et al. 



8 Conclusions 

In this paper, we evaluated features from a number of parallel programming lan- 
guages (MPI, UPC, OpenMP, Java, C/Pthreads) for their performance and ease 
of use. We find that languages such as UPC that support a shared memory and 
flexible non-local accesses can reduce the difficulty of parallel programming. Un- 
fortunately, parallel applications requiring fine-grain accesses still achieve poor 
performance on clusters because the amount of inherent software and hardware 
overhead, regardless of the programming paradigm or language feature used. 
Language support for fine-grain non-local accesses can still prove useful, by re- 
ducing the difficulty of parallel programming. Decent performance is achievable 
by using coarse-grain bulk communication in performance-critical sections of the 
code. 
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Abstract. We seek to extend the scope and efficiency of iterative com- 
pilation techniques by searching not only for program transformation 
parameters but for the most appropriate transformations themselves. 
For that purpose, we need a generic way to express program transforma- 
tions and compositions of transformations. In this article, we introduce 
a framework for the polyhedral representation of a wide range of transfor- 
mations in a unified way. We also show that it is possible to generate ef- 
ficient code after the application of polyhedral program transformations. 
Finally, we demonstrate an implementation of the polyhedral represen- 
tation and code generation techniques in the Open64/ORC compiler. 



1 Introduction 

Optimizing and parallelizing compilers face a tough challenge. Due to their im- 
pact on productivity and portability, programmers of high-performance applica- 
tions want compilers to automatically produce quality code on a wide range 
of architectures. Simultaneously, Moore’s law indirectly urges the architects 
to build complex architectures with deeper pipelines and (non uniform) mem- 
ory hierarchies, wider general-purpose and embedded cores with clustered units 
and speculative structures. Static cost models have a hard time coping with 
rapidly increasing architecture complexity. Recent research works on iterative 
and feedback-directed optimizations [17] suggest that practical approaches based 
on dynamic information can better harness complex architectures. 

Current approaches to iterative optimizations usually choose a rather small 
set of program transformations, e.g., cache tiling and array padding, and fo- 
cus on finding the best possible transformation parameters, e.g., tile size and 
padding size [17] using parameter search space techniques. However, a recent 
comparative study of model-based vs. empirical optimizations [22] stresses that 
many motivations for iterative, feedback-directed or dynamic optimizations are 
irrelevant when the proper transformations are not available. We want to extend 
the scope and efficiency of iterative compilation techniques by making the pro- 
gram transformation itself one of the parameters. Moreover, we want to search 
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for composition of program transformations and not only single program trans- 
formations. For that purpose, we need a generic method for expressing program 
transformations and composition of those. 

This article introduces a unified framework for the implemention and compo- 
sition of generic program transformations. This framework relies on a polyhedral 
representation of loops and loop transformations. By separating the iteration do- 
mains from the statement and iteration schedules, and by enabling per-statement 
transformations, this representation avoids many of the limitations of iteration- 
based program transformations, widens the set of possible transformations and 
enables parameterization. Few invariants constrain the search space and our 
non-syntactic representation imposes no ordering and compatibility constraints. 
In addition, statements are named independenty from their location and sur- 
rounding control structures : this greatly simplifies the practical description of 
transformation sequences. We beleive this generic expression is appropriate for 
systematic search space techniques. 

The corresponding search techniques and performance evaluations are out of 
the scope of this work and will be investigated in a follow-up article. This work 
presents the principles of our unified framework and the first part of its imple- 
mentation. Also, since polyhedral transformation techniques can better accom- 
modate complex control structures than traditional loop-based transformations, 
we start with an empirical study of control structures within a set of bench- 
marks. The four key aspects of our research work are: (1) empirically evaluating 
the scope of polyhedral program transformations, (2) defining a practical trans- 
formation environment based on a polyhedral representation, (3) showing that 
it is possible to generate efficient code from a polyhedral transformation, (4) 
implementing the polyhedral representation and code generation technique in 
a real compiler, Open64/ORC [18], with applications to real benchmarks. 

Eventually, our framework operates at an abstract semantical level to hide the 
details of control structures, rather than on a syntax tree. It allows per-statement 
and extended transformations that make few assumptions about control struc- 
tures and loop bounds. Consequently, while our framework is initially geared 
toward iterative optimization techniques, it can also facilitate the implemen- 
tation of statically driven program transformations in a traditional optimizing 
compiler. 

The paper is organized as follows. We present the empirical analysis of static 
control structures in Section 2 and discuss their significance in typical bench- 
marks. The unified transformation model is described in Section 3. Section 4 
presents the code generation techniques used after polyhedral transformations. 
Finally, implementation in Open64/ORC is described in Section 5. 



2 Static Control Parts 

Let us start with some related works. Since we did not directly contribute to 
the driving of optimizations and parallelization techniques, we will not compare 
with the vast literature in the field of model-based and empirical optimization. 
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Well-known loop restructuring compilers proposed unified models and inter- 
mediate representations for loop transformations, but none of them addressed 
the general composition and parameterization problem of polyhedral techniques. 
ParaScope [6] is both a dependence-based framework and an interactive source- 
to-source compiler for Fortran; it implements classical loop transformations. 
SUIF [11] was designed as an intermediate representation and framework for au- 
tomatic loop restructuring; it quickly became a standard platform for implement- 
ing virtually any optimization prototype, with multiple front-ends, machine- 
dependent back-ends and variants. Polaris [4] is an automatic parallelizing com- 
piler for Fortran; it features a rich sequence of analyzes and loop transforma- 
tions applicable to real benchmarks. These three projects are based on a syntax- 
tree representation, ad-hoc dependence models and implement polynomial al- 
gorithms. PIPS [12] is probably the most complete loop restructuring compiler, 
implementing polyhedral analyses and transformations (including affine schedul- 
ing) and interprocedural analyses (array regions, alias). PIPS uses an expressive 
intermediate representation, a syntax-tree with polyhedral annotations. 

Within the Omega project [14], the Petit dependence analyzer and loop re- 
structuring tool [13] is much closer to our work: it provides a unified polyhedral 
framework (space-time mappings) for iteration reordering only, and it shares our 
emphasis on per-statement transformations. It is intended as a research tool for 
small kernels only. The MARS compiler [16] is also very close to our work: its 
polyhedral representation allows to unify several loop transformations to ease 
the application of long transformation sequences. Its successes in iterative opti- 
mization [17] makes it the main comparison point and motivation for our work, 
although MARS lacks the expressivity of the affine schedules we use in our uni- 
fied model. 

Two codesign projects have a lot in common with our semi-automatic op- 
timization project. MMAlpha [10] is a domain-specific single assignment lan- 
guage for systolic array computations, a polyhedral transformation framework, 
and a high-level circuit synthesis tool. The interactive and semi-automatic ap- 
proach to polyhedral transformations were introduced by MMAlpha. The PICO 
project [20] is a more pragmatic approach to codesign, restricting the applica- 
tion domain to loop nests with uniform dependences and aiming at the selection 
and coordination of existing functional units to generate an application-specific 
VLIW processor. Both tools only target small kernels. 

2.1 Decomposition into Static Control Parts 

In the following, loops are normalized and split in two categories: loops from 0 
to some bound expression with an integer stride, called do loops; other kinds of 
loops, referred to as while loops. Early phases of the Open64 compiler perform 
most of this normalization, along with closed form substitution of induction 
variables. Notice some Fortran and C while loops may be normalized to do 
loops when bound and stride can be discovered statically. 

The following definition is a slight extension of static control nests [8] . Within 
a function body, a static control part (SCoP) is a maximal set of consecutive 



212 



Cedric Bastoul et al. 



SCoP decomposition 



SI 




SCoP 1, one statement, non rich 


do 


j=l, i*i 








32 

io k=0 , j 

if (j .ge. 2) then 
| S3 
S4 


SCoP 2, three statements, rich 
parameters: i,j 
iterators: k 



do p = 0, 6 
I S5 
S6 



SCoP 3, two statements, rich 
iterators: p 



Fig. 1. Example of decomposition into static control parts 



statements without while loops, where loop bounds and conditionals may only 
depend on invariants within this set of statements. These invariants include sym- 
bolic constants, formal function parameters and surrounding loop counters: they 
are called the global parameters of the SCoP, as well as any invariant appearing 
in some array subscript within the SCoP. A static control part is called rich when 
it holds at least one non-empty loop; rich SCoPs are the natural candidates for 
polyhedral loop transformations. An example is shown in Figure 1. We will only 
consider rich SCoPs in the following. 

As such, a SCoP may hold arbitrary memory accesses and function calls; 
a SCoP is thus larger than a static control loop nest [8]. Interprocedural alias 
and array region analysis would be useful for precise dependence analysis. Never- 
theless, our semi-automatic framework copes with crude dependence information 
in authorizing the expert user to override static analysis when applying trans- 
formations. 

2.2 Automatic Discovery of SCoPs 

SCoP extraction is greatly simplified when implemented within a modern com- 
piler infrastructure such as Open64/ORC. Previous phases include function in- 
lining, constant propagation, loop normalization, integer comparison normal- 
ization, dead-code and goto elimination, and induction variable substitution, 
along with language-specific preprocessing: pointer arithmetic is replaced by ar- 
rays, pointer analysis information is available (but not yet used in our tool), etc. 
The algorithm for SCoP extraction is detailed in [2]; it outputs a list of SCoPs 
associated with any function in the syntax tree. Our implementation in Open64 
is discussed in Section 5. 

2.3 Significance Within Real Applications 

Thanks to an implementation of the previous algorithm into Open64, we studied 
the applicability of our polyhedral framework to several benchmarks. 
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Fig. 2. Coverage of static control parts in high-performance applications 



Figure 2 summarizes the results for the SpecFP 2000 and PerfectClub bench- 
marks handled by our tool (single-file programs only, at the time being). Con- 
struction of the polyhedral representation takes much less time than the prelim- 
inary analyses performed by Open64/ORC. All codes are in Fortran77, except 
art and quake in C, and lucas in Fortran90. The first column shows the number of 
functions (inlining was not applied in these experiments) . The next two columns 
count the number of SCoPs with at least one global parameter and enclosing at 
least one conditional, respectively; the first one advocates for parametric analy- 
sis and transformation techniques; the second one shows the need for techniques 
that handle static-control conditionals. The next two columns in the “State- 
ments” section show that SCoPs cover a large majority of statements (many 
statements are enclosed in affine loops). The last two columns in the “Array 
References” section are very promising for dependence analysis: most subscripts 
are affine except for lucas and mg3d (the rate is over 99% in 7 benchmarks), but 
approximate array dependence analyses will be required for a good coverage of 
the 5 others. In accordance with earlier results using Polaris [7], the coverage 
of regular loop nests is strongly influenced by the quality of loop normalization 
and induction variable detection. 

Our tool also gathers detailed statistics about the number of parameters and 
statements per SCoP, and about statement depth (within a SCoP, not counting 
non-static enclosing loops). Figure 3 shows that almost all SCoPs are smaller 
than 100 statements, with a few exceptions, and that loop depth is rarely greater 
than 3. Moreover, deep loops also tend to be very small, except for applu, adm 
and mg3d which contain depth-3 loop nests with tenths of statements. This 
means that most polyhedral analysis and transformations will succeed and re- 
quire reasonable resources. It also gives an estimate of the scalability required 
for worst-case exponential algorithms, like the code generation phase to convert 
the polyhedral representation back to source code. 
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SpecFP: Statement Distribution 




SpecFP: Statement Depth 




PerfectClub: Statement Distribution 




Statement Range Statement Depth 



Fig. 3. Distribution of statement depths and SCoP size 



3 Unified Polyhedral Representation 

In this section, we define the principles of polyhedral program transformations. 
The term polyhedron will be used in a broad sense to denote a convex set of 
points in a lattice (also called Z-polyhedron or lattice-polyhedron), i.e. , a set of 
points in a Z vector space bounded by affine inequalities. 

Let us now introduce the representation of a SCoP and its elementary trans- 
formations. A static control part within the syntax tree is a pair ( S , i gp ), where 
S is the set of consecutive statements — in their polyhedral representation 
and i gp is the vector of global parameters of the SCoP. Vector i gp is constant 
for the SCoP but statically unknown; yet its value is known at runtime, when 
entering the SCoP. d gp = dim(i gp ) denotes the number of global parameters. 

We will use a few specific linear algebra notations: matrices are always de- 
noted by capital letters, vectors and functions in vector spaces are not; pfx(u, n ) 
returns a length- n prefix of v, i.e., the vector built from the n first components 
of v, u C w is equivalent to u being a prefix of u; 1& denotes the fc-th unit vector in 
a reference base (li, . . . , U) of a d-dimensional space, i.e., (0, . . . , 0, 1, 0, . . . , 0); 
likewise, ljj denotes the matrix filled with zeros but element (i. j) set to 1. 

A SCoP may also be decorated with static properties such as array depen- 
dences or regions, but this work does not address static analysis. 



3.1 Domains, Schedules and Access Functions 

The depth d s of a statement S is the number of nested loops enclosing S in 
the SCoP. A statement S £ S is a quadruple {V s , C s ,TZ S ,0 s ), where V s is 
the d s -dimensional iteration domain of S, C s and 1Z S are sets of polyhedral 
representations of array references, and 0 s is the affine schedule of S, defining 
the sequential execution ordering of iterations of S. To represent arbitrary lattice 
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Running example 



do i = 


1, N 




A(i) 


= 0 


(Si) 


do j = 


= 1, M 




| A (: 


l) = A(i) + B(i, 2*i+j-N-l) 


(S2) 


D [0] = 


1 


(Ss) 


do k = 


3, N, 2 




D(k) 


= 2*D(k-2) 


(54) 


E(k) 


= -A(k) ; 


(5s) 



Fig. 4. Running example 



polyhedra, each statement is provided with a number df p of local parameters to 
implement integer division and modulo operations via affine projection : e.g., the 
set of even values of i is described by means of a local parameter p — existentially 
quantified — and equation i = 2 p. Let us describe these concepts in more detail 
and give some examples. 

V s is a convex polyhedron defined by matrix A s £ M. n t ^ 3 +df +d gp +t(^) such 
that 

ieV s 3i lp ,yl s (i,i lp ,i gp ,l) f > 0. 

Notice the last matrix column is always multiplied by the constant 1; it corre- 
sponds to the homogeneous coordinate encoding of affine inequalities into linear 
form. The number n of constraints in A s is not limited. Statements guarded 
by non-convex conditionals — such asO<i<3Vi>8 — are separated into 
convex domains in the polyhedral representation. Figure 4 shows an example 
that illustrates these definitions. 

The domains of the five statements are V Sl = {i | 1 < * < N}, V s2 = 
{( i,j ) \ 1 < i < N,1 < j < M}, V Ss = {()} (the zero-dimensional vector), 
V Si = V Sb = {fc|3<fc<iVA 3 p,k = 3 + 2 p}. E.g., the T-matrices for 
statements S 2 and S 4 are 



A 8 * 



1 0 0 0 -1 

-10 10 0 

0 10 0-1 

0 -101 0 



w/ 



i = (i,j) 

*ip = 0 

ig P = (N , M) 



A Si 



1 0 0 0 -3 

-10 10 0 
1 -2 0 0 -3 
-1 2 0 0 3 



w/ 



i = (k) 
iip = (p) 

i ep =(N,M) 



C s and TZ S describe array references written by S (left-hand side) or read 
by S (right-hand side), respectively; it is a set of pairs (A, /) where A is an array 
variable and / is the access function mapping iterations in V s to locations in A. 
The access function / is defined by a matrix F £ -Mdi m (A), d®+df- t-d gp +i(^) such 
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that 

/( i) = F(i,i lp ,i gp ,l)‘. 

E.g., C S2 = { (A, (i)) } and 71 S2 = { (A, (i)) , (B, (i, 2 * i + j — N — 1)*) }, stored 
as 



: {(A, 
U S2 : {(a, 



1 0 0 0 0 
1 0 0 0 0 




■1 0 0 0 0 IN'! with 

.2 1-10 -lJJj 



i = (i,j) 

iip = 0 
ig P = {N , M) 



9 s is the affine schedule of S; it maps iterations in T> s to time-stamps (i.e. , 
logical execution dates) in 2 d s + 1-dimensional time [8] . Multidimensional time- 
stamps are compared through the lexicographic ordering over vectors, denoted by 
<C: iteration i of S is executed before iteration i' of S' if and only if 9 s (i) <C 9 s (i'). 

To facilitate code generation and to schedule iterations and statements in- 
dependently, we need 2 d s + 1 time dimensions instead of d s (the minimum for 
a sequential schedule). This encoding was first proposed by Feautrier [8] and 
used extensively by Kelly and Pugh [13] : dimension 2k encodes the relative or- 
dering of statements at depth k and dimension 2k — 1 encodes the ordering of 
iterations in loops at depth k. 

Eventually, 9 s is defined by a matrix 0 s £ ■M 2 d s +i,d s +d gp +i(.^‘) such that 



0 s (i) 



0 - 






. 1 ) 



t 



Notice 0 s does not involve local parameters, since lattice polyhedra do not 
increase the expressivity of sequential schedules. The schedules for the previous 
example are: 0 Sl (i) = (0,i,0) 4 , # S2 (i) = (0,i,l,j,0), 0 S3 (i) = (1), 0 S4 (i) = 

(2,M), 0 Ss (i) = (2,M). 

E.g., the 0-matrices for S 2 and S4 are: 



0 s 2 = 


'0 0 0 0 0' 
1 0 0 0 0 
0 0 0 0 1 


with 


i = (i,j) 

i lp = () 0 S4 = 


'0 0 0 2' 
10 0 0 


with 




0 10 0 0 
0 0 0 0 0 




i gp = (7V,M) 


0 0 0 0 





i= (k) 

iip=0 

i gp = (N,M) 



3.2 Invariants 

Our representation makes a clear separation between the semantically mean- 
ingful transformations expressible on the polyhedral representation from the 
semantically safe transformations satisfying the statically checkable properties. 
The goal is of course to widen the range of meaningful transformations without 
relying on the accuracy of a static analyzer. Although classical transformations 
are hampered from the lack of information about loops bounds, they may be 
feasible in a polyhedral representation separating domains from affine schedules 
and authorizing per-statement operations. To reach this goal and to achieve 
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a high degree of transformation compositionality, the representation enforces 
a few invariants on the domains and schedules. 

There is only one domain invariant. To avoid integer overflows, the coefficients 
in a row of A s must be relatively prime: 

VI < * < d sf ,gcd(A i ,i,...,yl<, < j gp+ i) = 1. (1) 



This restriction has no effect on the expressible domains. 

The first schedule invariant requires the schedule matrix to fit into a decom- 
position amenable to composition and code generation. It separates the square 
iteration reordering matrix A s £ Md s d s ( Z) operating on iteration vectors, from 
the parameterized matrix T s £ M.d s ,d +i(Z) and from the statement-scattering 
vector /3 s £ N dS+1 : 



r 0 • 
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0 • 


0 


^ 1 


A s • 
A i,i 


' ‘ Af d s 


rS 

1 1,1 


• • r s 

l,dg P 


Aqdg„+i 

/Jf 


0 ■ 


0 


0 • 


0 


Afp • 


Af d s 


r s 
1 2,1 


• • r s 

2,d gp 


rS 

1 2,d gp + l 






0 


0 




Afsp • 


' ■ A fs )d s 


rS 
1 d s , l 


• • r s 

d s ,d sp 


pS 

1 dS,d„ p+1 

Pfs J 


0 • 


0 


0 • 


0 



Statement scattering may not depend on loop counters or parameters, hence the 
zeroes in “even dimensions”. Notice (3 subscripts range from 0 to d s . 

Back to the running example, matrix 0 s 2 splits into 



A S2 



1 0 
0 1 



r s 2 



ooo 

ooo 



, / 3 s * = ( 0 , 1,0 Y 



The second schedule invariant is the sequentiality one: two distinct statement 
iterations may not have the same time-stamp: 

S^S'Vi^i'=>6 s (i)^0 3 '{i'). (3) 



Whether the iterations belong to the domain of S and S' does not matter in (3): 
we wish to be able to transform iteration domains without bothering with the 
sequentiality of the schedule. Because this invariant is hard to enforce directly, 
we introduce two additional invariants with no impact on schedule expressivity 
and stronger than (3): 

| det(A s )| = 1 , i.e., A s is unimodular, and S ^ S' => (3 s ^ /3 s . (4) 



Finally, we add a density invariant to avoid integer overflow and ease schedule 
comparison. The “odd dimensions” of the image of 0 s form a d s -dimensional 
sub-space of the multidimensional time, since A s is unimodular, but an addi- 
tional requirement is needed to enforce that “even dimensions” satisfy some form 
of dense encoding: 

> 0 =V 35' G S,pfx(/3 s , k) = pfx(/3 s ', k) A /?f = /?£ - 1, 



( 5 ) 
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i.e. , for a given prefix, the next dimension of the statement-scattering vectors 
span an interval of non-negative integers. 



3.3 Constructors 

We define some elementary functions on SCoPs, called constructors. Many ma- 
trix operations consist in adding or removing a row or column. Given a vector v 
and matrix M with dim(r;) columns and at least i rows, AddRow(M, i, v) inserts 
a new row at position i in M and fills it with the value of vector v, whereas 
RemRow(M,i) does the opposite transformation. Analogous constructors exist 
for columns, AddCol(M, j, v) inserts a new column at position j in M and fills it 
with vector v, whereas RemCol(M, j) undoes the insertion. Add Row and Rem Row 
are extended to operate on vectors. 

Displacement of a statement S is also a common operation. It only impacts 
the statement-scattering vector (3 s of some statements S' sharing some com- 
mon property with S. Indeed, forward or backward movement of S at depth £ 
triggers the same movement on every subsequent statement S' at depth t such 
that pfx(/3 s ,£) = pfx(/3 s , £). Although rather intuitive, the following definition 
with prefixed blocks of statements is rather technical. Consider a SCoP S , a 
statement-scattering prefix P defining the depth at which statements should be 
displaced, a statement-scattering prefix Q — prefixed by P — making the ini- 
tial time-stamp of statements to be displaced, and a displacement distance o; o 
is the value to be added/subtracted to the component at depth dim(P) of any 
statement-scattering vector /3 s prefixed by P and following Q. The displacement 
constructor Move(P, Q, o) leave all statements unchanged except those satisfying 

VS € 5, P C f3 s A (Q « (3 s V Q C P S ) : #L(p) - PLl (p) + <>■ (6) 

Constructors make no assumption about representation invariants and may vi- 
olate them. 

3.4 Primitives 

From the earlier constructors, we will now define transformation primitives that 
enforce the invariants and serve as building blocks for higher level, semanti- 
cally sound transformations. Most primitives correspond to simple polyhedral 
operations, but their formal definition is rather technical and will be described 
more extensively in a further paper. Figure 5 lists the main primitives affecting 
the polyhedral representation of a statement. 1 U denotes a unimodular matrix; 
M implements the parameterized shift (or translation) of the affine schedule 
of a statement; £ denotes the depth of a statement insertion, iteration domain 
extension or restriction; and c is a vector implementing an additional domain 
constraint. 

1 Many of these primitives can be extended to blocks of statements sharing a common 
statement-scattering prefix (like the fusion and split primitives). 
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Syntax Sz Name 


Prerequisites 


Effect 


LeftU(S, U) 
Unimodular 


s es au e M dS d s (z) 

A | det(U) | = 1 


A* <- U.A S 


RightU(S, U) 
Unimodular 


ses au eM dS dS (z) 

A | det(U) | = 1 


A a ~ A S .U 


Shift(S, M) 
Shift 


S G 5 A M e d gp + i( z ) 


r s <— r s + m 


Insert(S, I) 
Insertion 


f < d a A 0f +1 = • • • = 0 dS = 0, 
A(3S' 6 S, ptx(0 s ,£ + 1) C 0 s ' 
V (pfx(/3 s , £), 0 s - 1) C 0 s ') 


P = pfx(/3 a , l) 

S <- Move(P, (P, 0 s ), 1) U S 


Delete (5) 
Deletion 


s es 


P = pfx(/3 a , d s ) 

S <- Move(P, (P, 0 S S ) , - 1) \ S 


Extend (5, t) 
Extension 


s es 


d s <- d s + 1; A* AddCol(A K , £, 0); 

A s <- AddRow(AddCol(A s ,^, 0),£, p); 

0 s <- AddRow(/3 s ,L 0);P S AddRow(P s ,f, 0); 
V(A. F) e C s UK S ,F <- AddRow(F, £, 0) 


Restrict (5, i) 
Restriction 


s e 5 


d s d s - 1; A“ <- RemCol(A a , £); 

A s < — RemRow(RemCol(A a , £), £)■ 

0 s «- RemRow(/3 s ,£);P s <- RemRow(P s ,£); 
V(A. F) e£ s UK s ,F^- RemRow(F, l) 


CutDomain(5, c) 
Cut Domain 


SG5 

A dim(c) = d s + df p + d gp + 1 


A s AddRow(A a ,0, 

c/gcd(ci, . . . , c dS+i s +dgp+1 )) 


AddLP(S) 
Add Local 
Parameter 


s e s 


dfg <— + 1; 

A s <- AddCol(A s , d s + 1,0); 

V(A, F) e C s U 1Z S , F <- AddCol(F, d s + 1, 0) 


Fuse(P, o) 
Fusion 




b = max{^ lm(p)+1 | (P, o) C 0 s } + 1; 
Move((P, o+l), (P, o + 1), 6); 

Move(P, (P, o + 1), —1) 


Split (P, o, b ) 
Split 




Move(P, (P, o, b), 1); 

Move((P, o + 1), (P, o + 1), —b) 



Fig. 5. Main transformation primitives 



The last two primitives — fusion and split (or distribution) — show the 
benefit of designing loop transformations at the abstract semantical level of 
polyhedra. First of all, loop bounds are not an issue since the code generator 
will handle any overlapping of iteration domains. Next, these primitives do not 
directly operate on loops, but consider prefixes P of statement-scattering vectors. 
As a result, they may virtually be composed with any possible transformation. 
For the split primitive, vector (P, o) prefixes all statements concerned by the split; 
and parameter b indicates the position where statement delaying should occur. 
For the fusion primitive, vector (P, o + 1) prefixes all statements that should 
be interleaved with statements prefixed by (P, o) . Eventually, notice that fusion 
followed by split (with the appropriate value of b) leaves the SCoP unchanged. 

This table is not complete: privatization, array contraction and copy propa- 
gation require operations on access functions. 

3.5 Transformation Composition 

We will illustrate the composition of primitives on a typical example: two- 
dimensional tiling. To define such a composed transformation, we first build 
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Syntax Sz Name 


Prerequisites 


Effect 


Comments 


Interchange (S, o) 
Loop Interchange 


s es 

A o < d s 


U = I d S — 1 0,0 — lo+l.o+l + lo,o+l + lo+l,o 

S <- RightU(S, U ) 


swap rows 
o and o+l 


StripMine(S, o, k ) 
Strip Mining 


s es 

A o < d s 
A k > 0 


S <— Extend(S, o); 

S <- AddLP(S); 
p = d s + 1; 

u — d ^ + d + dgp + 1; 

S <— CutDomain(S, 1„+i — lo); 

S <— CutDomain^S, 1 0 — l 0 +i + (fc — l)l u ); 
S <— CutDomain^S, 1 0 — 1 p ); 

S <— CutDomain(S, l p — 1 0 ); 


local param. column 
constant column 

(i 0 < ip+i) 

(ip+i < ip + k - 1) 
(k X p < ii) 

(ii < k X p) 


Tile(S, o, k ) 
Tiling 


s es 

A o < d s 
A k > 0 


S — StripMine(S, o, fc); 

S <— StripMine(S, 0 + 2, fc); 
S <— Interchange^, o + 1); 





Fig. 6. Composition of transformation primitives 



the strip-mining and interchange transformations from the primitives, as shown 
in Figure 6. 

Interchange^, 6) swaps the roles of i 0 and i 0+ i in the schedule of 5; it 
is a per-statement extension of the classical interchange. StripMine(5', o, k) — 
where k is a known integer — prepends a new iterator to virtually fc-times unroll 
the schedule and iteration domain of S at depth o. Finally, Tile(5', o, k ) tiles 
the loops at depth o and o + 1 with k x k blocks. 

This tiling transformation is a first step towards a higher-level combined 
transformation, integrating strip-mining and interchange with privatization, ar- 
ray copy propagation and hoisting for dependence removal. The only remaining 
parameters would be the statements and loops of interest and the tile size. 

4 Code Generation 

After polyhedral transformations, code generation is the last step to the final 
program. It is often ignored in spite of its impact on the target code quality. In 
particular, we must ensure that a bad control management does not spoil per- 
formance, for instance by producing redundant guards or complex loop bounds. 

Ancourt and Irigoin [1] proposed the first solution, based on the Fourier- 
Motzkin pair-wise elimination. The scope of their method was limited to a single 
polyhedron with unimodular transformation (scheduling) matrices. The basic 
idea was to apply the transformation function as a change of base of the loop 
indices, then for each new dimension, to project the polyhedron on the axis and 
thus find the corresponding loop bounds. The main drawback of this method was 
the large amount of redundant control. Most further works on code generation 
tried to extend this first technique, in order to deal with non-unit strides [15, 
21] or with a non-invertible transformation matrix [9]. A few alternatives to 
Fourier-Motzkin were discussed, but without addressing the challenging problem 
of scanning more than one polyhedron at once. 

This problem was first solved and implemented in Omega by generating 
a naive perfectly nested code and then by (partially) eliminating redundant 
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guards [14]. Another way was to generate the code for each polyhedron sepa- 
rately, and then to merge them [9, 5]; it generates a lot of redundant control, 
even if there were no redundancies in the separated code. Quillere et al. pro- 
posed to recursively separate union of polyhedra into subsets of disjoint polyhe- 
dra and generating the corresponding nests from the outermost to the innermost 
levels [19]. This approach provides at present the best solutions since it totally 
eliminates redundant control. However, it suffers from some limitations, e.g. high 
complexity, code generation with unit strides only, and a rigid partial order on 
the polyhedra. Improvements are presented in the next section. 

This section presents the code generation problem, its resolution with a mod- 
ern polyhedral-scanning technique, and its implementation. 

4.1 The Code Generation Problem 

In the polyhedral model, code generation amounts to a polyhedron scanning prob- 
lem: finding a set of nested loops visiting each integral point, following a given 
scanning order. The generated code quality can be assessed by using two val- 
uations: the most important is the amount of duplicated control in the final 
code; second, the code size, since a large code may pollute the instruction cache. 
We choose the recent Quillere et al. method [19] with some additional improve- 
ments, which guarantee a code generation without any duplicated control. The 
outline of the modified algorithm is presented in Section 4.2 and some useful 
optimization are discussed in Section 4.3. 

4.2 Outline of the Code Generation Algorithm 

Our code generation process is divided in two main steps. First, we take the 
scheduling functions into account by modifying each polyhedron’s lexicographic 
order. Next, we use an improved Quillere et al. algorithm to perform the actual 
code generation. 

When no schedule is specified, the scanning order is the plain lexicographic 
order. Applying a new scanning order to a polyhedron amounts to adding new 
dimensions in leading positions. Thus, from each polyhedron V s and scheduling 
function 9 s , we build another polyhedron T s with the desired lexicographic 
order: (t, i) £ T s if and only if t = 9 s (i). The algorithm is a recursive generation 
of the scanning code, maintaining a list of polyhedra from the outermost to the 
innermost loops: 

1. intersect each polyhedron of the list with the context of the current loop (to 
restrict the scanning code to this loop); 

2. project the resulting polyhedra onto the outermost dimensions, then separate 
the projections into disjoint polyhedra; 

3. sort the resulting polyhedra such that a polyhedron is before another one if 
its scanning code has to precede the other to respect the lexicographic order; 

4. merge successive polyhedra having at least another loop level to generate 
a new list and recursively generate the loops that scan this list; 

5. compute the strides that the current dimension imposes to the outer dimen- 



sions. 
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This algorithm is slightly different from the one presented by 
Quillere et al. in [19]; our two main contributions are the support for non-unit 
strides (Step 5) and the exploitation of degrees of freedom (i.e., when some 
operations do not have a schedule) to produce a more effective code (Step 4). 

Let us describe this algorithm with a non-trivial example: the two polyhedral 
domains presented in Figure 7(a). Both statements have iteration vector (i,j), 
local parameter vector (fc) and global parameter vector (n). We first compute 
intersections with the context, supposed to be n > 6. We project the polyhedra 
onto the first dimension, i, then separate them into disjoint polyhedra. Thus we 
compute the domains associated with T Sl alone, both T Sl and T Sa , and T S2 
alone (as shown in Figure 7(b), this last domain is empty). We notice there is 
a local parameter implying a non-unit stride; we can determine this stride and 
update the lower bound. We finally generate the scanning code for this first 
dimension. We now recurse on the next dimension, repeating the process for 
each polyhedron list (in this example, there are now two lists: one inside each 
generated outer loop). We intersect each polyhedra with the new context, now 
the outer loop iteration domains; then we project the resulting polyhedra on the 
outer dimensions, and finally we separate these projections into disjoint polyhe- 
dra. This last processing is trivial for the second list but yields two domains for 
the first list, as shown in Figure 7(c). Eventually, we generate the code associated 
with the new dimension. 



j 

n 

7 - 
6 



2 

1 



o Si • S2 




o 

o 

o 

o 



6 7 



n 




{ 1 < i < n 
i = 2k + 1 

1 < j < n 



[ 1 < i < 6 

Tf 2 (n) : ( i = 2k + 1 

I 1 < j < 7 - i 



do i = 1, 6, 2 

ri s i(»)di<i<»} 

T? 2 {n):{l<j<7-i} 

do i = 7, n, 2 

T l S 2( ra ) : l 1 < j < "} 




r code generation example _ 
do i = 1, 6, 2 
do j = 1, 7-i 
| SI; S2 
do j = 8-i, n 

I S1 

do i = 7, n, 2 
I do j = 1, n 

I I S1 



(a) Initial domains to scan ( b ) Projection and separation (c) Recursion on next 

on the first dimension dimensions 



Fig. 7. Step by step code generation example 
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4.3 Complexity Issues 

The main computing kernel in the code generation process is the separation 
into disjoint polyhedra, with a worst-case 0(3 n ) complexity in polyhedral op- 
erations (exponential themselves). In addition, the memory usage is very high 
since we have to allocate memory for each separated domain. For both issues, 
we propose a partial solution. First of all, we use pattern matching to reduce 
the number of polyhedral computations: at a given depth, the domains are often 
the same (this is a property of the input codes), or disjoint (this is a property 
of the statement-scattering vectors of the scheduling matrices) . Second, to avoid 
memory problems, we detect high memory consumption and switch for a more 
naive algorithm when necessary, leading to a less efficient code but using far less 
memory. 

Our implementation of this algorithm is called CLooG (Chunky Loop Gen- 
erator) and was originally designed for a locality-improvement algorithm and 
software (Chunky) [3]. CLooG could regenerate code for all 12 benchmarks in 
Figure 2. Experiments were conducted on a 512 MB 1 GHz Pentium III ma- 
chine; generation times range from 1 to 127 seconds (34 seconds on average). 
It produced optimal control for all but three SCoPs in lucas, apsi and adm; the 
first SCoP has more than 1700 statements and could be optimally generated on 
a 1 GB Itanium machine in 22 minutes; the two other SCoPs have less than 50 
statements, but 16 parameters; since the current version of does not analyse the 
linear relations between variables, the variability of parameter interactions leads 
to an exponential growth of the generated code. Complexity improvements and 
studies of the generated code quality are under investigation. 



5 WRaP-IT: An Open64 Plug-In for Polyhedral 
Transformations 

Our main goal is to streamline the extraction of static control parts and the 
code generation, to ease the integration of polyhedral techniques into optimiz- 
ing and parallelizing compilers. This interface tool is built on Open64/ORC. It 
converts the WHIRL — the compiler’s hierarchical intermediate representation 
- to an augmented polyhedral representation, maintaining a correspondence 
between matrices in SCoP descriptions with the symbol table and syntax tree. 
This representation is called the WRaP: WHIRL Represented as Polyhedra. It 
is the basis for any polyhedral analysis or transformation. Then, the second part 
of the tool is a modified version of CLooG, to regenerate a WHIRL syntax tree 
from the WRaP. The whole Interface Tool is called WRaP-IT; it may be used 
in a normal compilation or source-to-source framework, see [2] for details. 

Although WRaP-IT is still a prototype, it proved to be very robust; the 
whole source-to-polyhedra-to-source transformation was successfully applied to 
all 12 benchmarks in Figure 2. See http : //www-rocq. inria.fr/ a3/wrap-it for 
further information. 
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6 Conclusion 

We described a framework to streamline the design of polyhedral transforma- 
tions, based on a unified polyhedral representation and a set of transformation 
primitives. It decouples transformations from static analyses. It is intended as 
a formal tool for semi-automatic optimization, where program transformations 
— with the associated static analyses for semantic-preservation — are separated 
from the optimization or parallelization algorithm which drives the transforma- 
tions and select their parameters. 

We also described WRaP-IT, a robust tool to convert back and forth be- 
tween Fortran or C and the polyhedral representation. This tool is implemented 
in Open64/ORC. The complexity of the code generation phase, when convert- 
ing back to source code, has long been a deterrent for using polyhedral repre- 
sentations in optimizing or parallelizing compilers. However, our code generator 
(CLooG) can handle loops with more than 1700 statements. Moreover, the whole 
source-to-polyhedra-to-source transformation was successfully applied to the 12 
benchmarks. This is a strong point in favor of polyhedral techniques, even in the 
context of real codes. 

Current and future work include the design and implementation of a poly- 
hedral transformation library, an iterative compilation scheme with a machine- 
learning algorithm and/or an empirical optimization methodology, and the opti- 
mization of the code generator to keep producing optimal code on larger codes. 
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Abstract. In this paper, we present a technique to perform dependence 
analysis on more complex array subscripts than the linear form of the 
enclosing loop indices. For such complex array subscripts, we decouple 
the original iteration space and the dependence test iteration space and 
link them through index-association functions. Dependence analysis is 
performed in the dependence test iteration space to determine whether 
the dependence exists in the original iteration space. The dependence dis- 
tance in the original iteration space is determined by the distance in the 
dependence test iteration space and the property of index-association 
functions. For certain non-linear expressions, we show how to equiva- 
lently transform them to a set of linear expressions. The latter can be 
used in traditional dependence analysis techniques targeting subscripts 
which are linear forms of enclosing loop indices. We also show how our 
advanced dependence analysis technique can help parallelize some oth- 
erwise hard-to-parallelize loops. 



1 Introduction 

Multiprocessor and multi-core microprocessor machines demand good automatic 
parallelization to utilize precious machine resources. Accurate dependence anal- 
ysis is the essential for effective automatic parallelization. 

Traditional dependence analysis only considers array subscripts which are 
linear functions of the enclosing loop indices [6, 8, 13]. Various techniques, from 
a simple one like the GCD test to a complex one like the Fourier-Motzkin test, 
are applied to determine whether two array references could access the same 
memory location. For more complex subscripts, these techniques often consider 
them too complex and will give up with the assumption that a dependence exists. 
Figure 1(a) shows a simple example, where these traditional techniques are not 
able to parallelize it because they make the worst assumption. (In this paper, 
the program is written in Fortran format.) 

This paper tries to conquer this conservativity. We apply a decoupled ap- 
proach where a new dependence test iteration space is constructed for depen- 
dence test purpose. The original iteration space is linked to the dependence 
test iteration space by the mapping through index-association functions. We call 
our approach index-association based dependence analysis. Dependence analysis 
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DO I = L,U 

J = MOD(I + Ci, C 2 ) + C 3 

A(J ) = . . . (no references to A) . . . 

END DO 



IF (C 2 > (U — L + 1)) THEN 

! The following loop is DOALL 
DO I — L,U 

J = MOD(I + Ci, C 2 ) + C 3 
A(J) = . . . (no references to A) 

END DO 
ELSE 

! The following loop is not DOALL 
DO I = L,U 

J = MOD(I + Ci, C 2 ) + C 3 
A(J) = . . . (no references to A) 

END DO 
END IF 



(a) 



(b) 

Fig. 1 . Example 1 



is performed under the dependence test iteration space. Whether the depen- 
dence exists in the original iteration space is determined by whether the depen- 
dence exists in the dependence test iteration space. If the dependence exists, 
the dependence distance in the original iteration space is determined by the de- 
pendence distance in the dependence test iteration space and the property of 
index-association functions. 

We also present a general approach to equivalently transform a non-linear 
expression, involving plus , minus, multiplication and division, to a set of lin- 
ear expressions. The latter can be used in dependence testing with traditional 
techniques. 

When performing traditional dependence analysis and analyzing the index- 
association functions, our dependence analysis framework is also able to generate 
certain conditions under which cross-iteration dependence does not exist in the 
original iteration space. Such a condition can often be used as a run-time test for 
parallelization vs. serialization of the target loop. With the combination of index- 
association based dependence analysis and such two-version code parallelization, 
the code in Figure 1(a) can be parallelized as the code in Figure 1(b). 

We have implemented the index-association based dependence analysis in 
our production compiler. Before this implementation, our compiler already im- 
plemented several dependence tests targeting subscripts which are linear func- 
tions of enclosing loop indices, which already enables us to parallelize a lot of 
loops. With this new implementation, our compiler is able to parallelize some 
loops which otherwise are not able to be parallelized without it. We select two 
well-known benchmarks from SPEC CPU2000 suite. With our technique, several 
important loops inside these two benchmarks can be parallelized successfully. 

In the rest of the paper, we describe the previous work in Section 2. We 
present a program model in Section 3. We then describe our index-association 
based dependence analysis in Section 4. We present how to transform a non-linear 
expression to a set of linear expressions in Section 5. We show how our advanced 
dependence analysis helps automatic parallelization in Section 6. We present 
experimental results in Section 7. Finally, a conclusion is drawn in Section 8. 
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DO /i = Li, f/i, Si 

DO I 2 = L 2 ,U 2 ,S 2 

DO I n = L n , U„, S n 
Jl = fl(h,-,In) 

= f 2 (h,-,In) 

Jm = f m (h,...,In) 

using linear form of (Jx, ... , J m ) in array subscripts 

END DO 

END DO 
END DO 



Fig. 2. Program Model 



2 Previous Work 

Dependence analysis has been studied extensively. Maydan et al. use a series of 
special case dependence tests with the hope that they can catch the majority of 
cases in practice [6] . They use an expensive integer programming method as the 
backup in case all these special tests fail to determine whether the dependence 
exists or not. Goff et al. present a practical dependence testing by classifying 
subscripts into different categories, where different dependence tests are used in 
different categories [3]. Pugh presents an integer programming method for exact 
dependence analysis with the worst exponential complexity in terms of loop levels 
and the number of array dimensions [8]. Feautrier analyzes dependences using 
parametric integer programming [2] . His technique takes statement context into 
consideration so that the dependence test result is more accurate. All the above 
techniques focus on array subscripts which are linear functions of enclosing loop 
indices. 

Dependence analysis with array subscripts which are not linear functions of 
enclosing loop indices has also been studied. Blume and Eigenmann propose 
range test, where the range of symbolic expressions is evaluated against the loop 
index value [1]. The loop can be parallelized if the range of elements accessed 
in one iteration does not overlap with the range of the other elements in other 
iterations. Haghighat and Polychronopoulos handle non-linear subscripts by us- 
ing their mathematical properties [4]. They use symbolic analysis and constraint 
propagation to help achieve a mathematically easy-to-compare form for the sub- 
scripts. Hoeflinger and Paek present an access region dependence test [5]. They 
perform array region analysis and determine dependence based on whether array 
regions overlap with each other or not. All these works are complementary to 
our work and can be used in our work as our dependence test iteration space 
can be extended to include more complex subscripts. 

3 Program Model 

Figure 2 illustrates our program model. Our target loop nest is an n-level perfect 
nest where n > 1. The loop lower bound Lf~( 1 < k < n) and loop upper bound Uk 
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are linear functions of the enclosing loop indices I p ( 1 < p < k — 1). The loop 
steps Sk{ 1 < k < n) are loop nest invariants. 

In the beginning of the innermost loop body, we have m(m > 1) functions 
which maps a set of values (I±, . . . , I n ) to a new set of values ( J\, . . . , J m ). In 
the rest of loop body, linear combinations of (Ji,..., J m ) are used in array 
subscripts. 

We call the iteration space defined by all possible values of (Ji, . . . , /„) as the 
original iteration space. We call the iteration space defined by all possible values 
of (Ji, . . . , J m ) as the dependence test iteration space. We call such a mapping 
from the original iteration space to the dependence test iteration space as index 
association and functions fk{ 1 < k < m) as index-association function. 

In modern compilers, symbolic analysis is often applied before data depen- 
dence analysis to compute fk ■ Traditional dependence analysis techniques are 
able to handle index-association functions fk that are linear functions. For 
such cases, the function can be forward substituted into the subscript (to re- 
place Jk(k = 1, . . . ,m)) and traditional techniques apply. However, if any fk is 
a non-linear function (e.g., the tiny example in Figure 1), traditional techniques 
often consider the subscript too complex and assume the worst dependence con- 
servatively. In the next section, we present details of our index-association based 
dependence analysis, which tries to conquer such conservativity. 



4 Dependence Analysis with Index Association 

The index-association based dependence analysis can be partitioned into three 
steps. First, the dependence test iteration space is constructed. Second, depen- 
dence analysis is conducted in the dependence test iteration space. Finally, the 
dependence relation in the original iteration space is determined by the result 
in the dependence test iteration space and the property of index-association 
functions. We elaborate the details below. 



4.1 Constructing Dependence Test Iteration Space 

The original iteration space can be viewed as a n-dimensional space. For di- 
mension fc( 1 < k < n), we have a constrain (Lk,Uk, Sk), where Lk is the lower 
bound, Uk is the upper bound and Sk is the step value. 

To construct the dependence test iteration space, the compiler needs to ana- 
lyze the index-association functions. Currently, our compiler requires the index- 
association function fk have the following two properties: 

— Each index function fk only takes one original loop index variable as the 
argument (Note that different fk can take the same loop index variable as 
the argument.) For example, our compiler can handle the index-association 
functions like J\ = DIV ( I\ , 2), while it is not able to handle J\ — DIV (I± + 
I 2 , 2), where both I\ and I 2 are outer loop index variables. 
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Table 1 . Iteration space mapping 



operator 


expression 


iteration space 


plus 


f(I)+C 


(Z + c, u + c, s ) 


minus 


m - c 


(l — c, u — c, s ) 


mult 


cfU) 


(cl, cu , cs ) 


division 


f(I)/c 


(l/c, u/c, (Ls/cJ, | s/c] )) 


modulo 


MOD(f(I),c ) 


( MOD(l , c), MOD(u, c), ( s , _ others . )) 



It is possible to relax such a requirement for index-association functions, in 
order to cover more cases. For certain cases, we can transform the index- 
association function to make it conform to the requirement. For example, 
for the function /fc(/i,/ 2 ) = DIV(I\ , 2) + 2 * / 2 , we can have fki{h) = 
DIV{h,2), /fc 2 (/ 2 ) = 2 * I 2 and fk{h, h) = fki(h) + fk 2 {h)- If we prop- 
agate /fc(/i,/ 2 ) into the subscripts, index-association functions fki and fki 
will satisfy the requirement. For more general cases, however, it is much 
more difficult to compute the dependence test iteration space. We leave such 
extension for our future work. 

— The operators in fk must be plus, minus, multiplication , division or modulo. 
The fk can be composed using the permitted operators recursively. For ex- 
ample, our compiler is able to handle J\ = DIV(2Ii + 3,4) where I± is an 
outer loop index. 

Given the original iteration space, our compiler tries to construct the corre- 
sponding dependence test iteration space for Jk = fk(I P ){ 1 < k < m, 1 < p < n) 
with a form {lk,Uk, s*,), where Ik is the lower bound, Uk is the upper bound 
and Sk is the step. Supposing the loop I p has a lower bound L p and an upper 
bound Up, we have Ik = fk(Lp) and Uk = fk{U p ). The step Sk represents the dif- 
ference between two Jk values mapped from two adjacent I p values. Note that Sk 
could be a sequence of values, including 0. 

Suppose that in Figure 2 there exists a dependence from iteration (in, 
i 2 i, , i n i) to iteration (ii 2 , i 22 , . . . , i n2 ). We say the corresponding dependence 
distance is (ii 2 — in, i 22 — i 2 i, ■ • ■ , i n 2 — ini) in the original iteration space. Sup- 
pose that (jn,j 2 i, ■ ■ ■ ,jmi) are the corresponding J values for (in, i 2 i, . . . , i n i), 
and (ji 2 , j 22 . . . . , j m 2 ) for (ii 2 , i 22 , . . . , i n2 ). The dependence distance in the de- 
pendence test iteration space is (j 12 - ju-j -22 - ji i, • • • ,j n 2 - jn i). 

Table 1 illustrates our basic iteration space mapping from original iteration 
space to dependence test iteration space, assuming the iteration space for /(/) 
is ( l,u,s ). For division, two different steps may result. For modulo, because of 
the wrap-around nature of the function, some negative steps may appear which 
are represented by ^others _ in the table. The dependence test iteration space 
is computed by recursively computing the iteration space for sub-expressions 
of fk(I P ){ 1 < p < n), starting with I p and ending with fk(I p ). 

Here, we want to specially mention the following two scenarios: 
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DO I = l,N 



DO I = 1,N, 2 



DO I = 1,1V, 2 
J = DJV(/, 2) 

A(J) = A(J + 1 + N/2) 



J = DIV(I, 2) 
A(J) = 5 * J 



J = DIV(I, 2) 
A(J) = A(J + 2) 



END DO 



END DO 



END DO 



(a) 



(b) 

Fig. 3. Example 2 



(c) 



— Because it may potentially generate many negative step values for a modulo 
operator, a condition is often generated considering the relation between 
u — l + 1 and c (in Table 1), in order to limit the number of negative steps. 

— It is possible to have different Jk associated with the same I p such as J^ 1 = 
/fci (Ip) and Jk 2 = fk 2 (Ip )■ The coupling relation of J ^ and Jk 2 will be lost 
in the dependence test iteration space, which will cause difficulties when the 
dependence distance in the dependence test iteration space is mapped back 
to the original iteration space. For such cases, if functions fk t and fk 2 are 
both linear forms, we will perform forward substitution for these functions 
and have a single Jk = I p as the index-association function. Otherwise, we 
can still perform dependence analysis in the dependence test iteration space. 
However, we are not able to compute the dependence distance in the original 
iteration space precisely. 

Figure 3 shows three examples. For Figure 3(a), the original iteration space 
is (1, N, 1). The dependence test iteration space is (0, N/2, s), where the step s 
is variant with a value of 0 or 1. For Figures 3(b) and (c), the original iteration 
space is (1, N, 2). The dependence test iteration space is (0, N/2, 1). 

4.2 Dependence Analysis in the Dependence Test Iteration Space 

After the dependence test iteration space is constructed, dependence analysis 
can be done in the dependence test iteration space, where traditional techniques, 
which target the linear form of the enclosing loop indices, are applied. 

However, note that the dependence test iteration space could have multiple 
step values in certain dimension. For such cases, traditional techniques have to 
assume a step value which is greatest common divisor of all possible non-zero 
step values. If the step value could be 0, we also assume a step value of 0 during 
dependence analysis. With such assumptions, we may get conservative results. 
In Section 5, we describe a technique which can potentially give us better results 
for such cases. 

Given a pair of references, there are three possible results from the depen- 
dence test in the dependence test iteration space. 

— If there exists no dependence in the dependence test iteration space, then 
there will be no dependence in the original iteration space. 

— If there exists a dependence with a distance d in the dependence test iteration 
space, then we compute the dependence distance in the original space based 
on d and the property of index-association functions. This will be further 
explored in the next subsection. 
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Table 2. Dependence distance mapping 



operator 


org expr 


org dist 


new expr 


new dist 


plus 


f(I)+C 


d 


ni) 


d 


minus 




d 


ni ) 


d 


mult 


cf(I) 


d 


ni ) 


d/c if MOD(d , c) = 0, no dependence otherwise 


division 


f(l)/c 


d 


f(i ) 


(dc — c+l,...,dc + c— 1) 


modulo 


MOD(f(I),c ) 


d 


nn 


d 



— If there exists a dependence with an unknown distance in the dependence 
test iteration space, we simply regard that there exists an unknown distance 
dependence in the original iteration space. 

In Figure 3(a), because the step can have a value of 0, the dependence dis- 
tance from A(J) to itself could be 0 in the dependence test iteration space. In 
Figures 3(b) and (c), however, there exists no dependence from A(J) to itself in 
the dependence test iteration space. In Figure 3(b), there exists a dependence 
from A(J + 2) to A(J) with distance 2 in the dependence test iteration space. In 
Figure 3(c), because the dependence test iteration space for J is (0, N/ 2, 1), we 
can easily get that there exist no dependence between A(J) and A{ J + 1 + N/ 2) 
in the dependence test iteration space. 

4.3 Computing Dependence Distance in Original Iteration Space 

Given a dependence distance in the dependence test iteration space, we need to 
analyze the property of index-association functions in order to get the proper 
dependence distance in the original iteration space. Table 2 illustrates how we 
compute the dependence distance in the original iteration space based on index- 
association functions, where “org expr” and “org dist” represents the original 
expression and its associated distance, and “new expr” and “new dist” represents 
the sub-expression in the original expression and its associated distance. The 
dependence distance in the original iteration space is computed by recursively 
computing the distance for the sub-expression of Jk = fk(Ip){ 1 < k < m, 1 < 
p < n), starting with fk{I v ) and ending with I p . 

In Table 2, we want to particularly mention the dependence distance calcu- 
lation of /(/) for multiplication and division. Let us assume that iterations i\ 
and *2 have a dependence. For multiplication, we have c/(*i) — 0 /( 12 ) = c(/(i 1 ) — 
f(i 2 )) = d. We can derive /(*i) — f(i 2 ) = d/c if MOD(d,c) = 0. Otherwise, 
there will be no dependence between /(*i) and /fe)- For division , we have 
f{ii)/c — /(i 2 )/c = d. We want to find the range of /(*i) — /(* 2 )- Through 
mathematical manipulation, we can find dc — c + 1 < /(*i) — /(* 2 ) < dc + c — 1 
for general cases, as illustrated in Table 2. For certain cases, however, we can 
get more precise result. For example, if MOD{f{i),c) is always equal to 0, the 
distance for /(/) would be solely (/(*i) — /(* 2 ))/c. 

In Figure 3(a), there exists a dependence from A{J) to itself with a dis- 
tance 0 in the dependence test iteration space. Because of index-association 
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Input: A perfect loop nest conforming to Figure 2. 

Output: Dependence relations between references inside the loop nest. 

Procedure: 

Analyze fk(k = 1, . . . , m) and try to construct the dependence test iteration space, 
if (the dependence test iteration space cannot be constructed successfully) then 
Assuming a worst-case dependence test iteration space. 

end if 

for (each pair of references r± and r 2 ) 

if (there exists no dependence in the dependence test iteration space) then 
There exists no dependence in the original space, 
else if (the dependence distance is d in the dependence test iteration space) then 
Compute the distance in the original space based on d and /*. . 

else 

There exists dependence in the original space with unknown distance. 

end if 
end for 

End procedure 

Fig. 4. Top algorithm for index-association based dependence analysis 



DO I = l, 100, 3 
J = 5*1/4 
A(J + 9) = A{J) + 1 

END DO 



Fig. 5. Example 3 



function DIV(I, 2), it is easy to see that the corresponding distance in the orig- 
inal iteration space is 0 or 1. (The —1 is an illegal distance and is ignored.) 

In Figure 3(b), there exists a dependence from A( J + 2) to A(J) with a dis- 
tance 2 in the dependence test iteration space. Because of index-association 
function DIV(I, 2), the corresponding distance in the original iteration space 
would be 3 or 4 or 5. 



4.4 Overall Structure 

Figure 4 shows our overall algorithm for index-association based dependence 
analysis. The first step of our index-association based dependence analysis is 
to construct the dependence test iteration space. If the dependence test space 
cannot be constructed due to complex index-association functions, we have to 
assume a worst-case dependence test iteration space, i.e. , for each Jk with itera- 
tion space (Ik, Uk, Sfc), we have Ik = — oo, Uk = +oo and Sk could be any integer 
value. 

As stated previously, if there exists multiple steps for certain dimension in 
the dependence test iteration space, dependence analysis must assume a conser- 
vative step, often the greater common divisor of all possible steps, in order to 
compute correct dependence relation. The resultant dependence relation, how- 
ever, might be conservative. For example, for the loop in Figure 5, the steps 
for J values can be either 3 or 4. So our index-association based approach has to 
take the conservative step of 1 in the dependence test iteration space. This will 
assume array references A(J + 9) and A(J) have cross-iteration dependences. 
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Hence, the original loop I cannot be parallelized. In the next section, we present 
a technique to handle certain index-association functions with division , which 
can be equivalently transformed to a set of linear expressions. The latter can be 
used to compute the dependence relation, including dependence distances, more 
precisely than with traditional techniques. 



5 Accurate Dependence Analysis with Division 

The basic idea here is to replace the non-linear expression with a set of linear 
expressions and then use these linear expressions during dependence testing with 
traditional techniques. Specifically, we want to find a set of linear expressions 
which are equivalent to J = f(I), where the index I has the iteration space 
{L, U, S) and the function / contains operations such as plus, minus, multiplica- 
tion and division. 

Without losing generality, we assume U > L and S > 0. Let t be the loop trip 
count for loop I, and we have t = L C/ ~g +S J . Let ij , i 2 , ■ ■ -,it represent the t loop 
index I values, from the smallest one to the largest one. Let j p = f{i p ), 1 < p < t, 
be the corresponding J index values. 

First, let us take the loop in Figure 5 as an example. We want to ex- 
press J = 5 * 7/4 as a set of linear expressions. For the I value sequence 
(1,4, 7, 10, 13, 16, 19, 22, ... , 97, 100), the corresponding J value sequence is 
(1, 5, 8, 12, 16, 20, 23, 27, ... , 121, 125). Clearly, the J value sequence is not a lin- 
ear sequence because the difference between adjacent values vary. However, note 
that the difference between every pth and ( p + 4)th J values (1 < p < t — 4) 
is a constant of 15. Therefore, the original J value sequence can be represented 
as 4 linear sequences, each with a step of 15 and initial value, 1, 5, 8 and 12 
respectively. 

To generalize the above observation, for a sequence of J values j p ( 1 < p < t), 
we want to find r, the number of linear expressions needed to represent j p , and 
cr, the step value for each individual linear expression. 

The difference between the J values in the J value sequence can be expressed 
as 

jsi = ji-ji = /(* 2 ) - /(H), 

js 2 = J 3 - ji = f{i 3) - /(*2), 

js t - 1 = jt - jt-i = f{it ) - f{it- 1 ). 

With the semantics of r, we have js p = js p+T ,\/l < p,p+ t, < t — 1, holds. 
This is equivalent to 

f(ip+ 1) - f(ip ) = /(*p+r+l) - /(*p+r),V 1 < P,P+T < t- 1. (1) 

Different index-association functions / may require different complexities to 
compute r. Conservative methods can also be applied if the compiler is not able 
to do sophisticated analysis and manipulation. The compiler has to make the 
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worst assumption if it can not find a compiler-time known constant r, e.g., using 
the dependence analysis technique in Section 4. 

Now suppose t is available, for each linear expression, we can easily compute 
the corresponding step as 

° = f(i P +r) - < P,P + t < t - 1. (2) 

In this paper, we do not try to construct the trip count for different linear ex- 
pressions and rather conservatively assume a trip count which equals to that for 
the linear expression with the initial value of f(L ), which also has the maximum 
trip count over all r linear expressions. 

With t and a available, the J = f(I) can be expressed as 

J = r * I' + r' (3) 

where I' is an integer variable and its iteration space is (0, ) i) ; an d 

r' is a set of r discrete numbers {f(i p )\l < p < r}. 

Since the set of linear expressions is equivalent to the original non-linear 
expression, whether a dependence exists with the original non-linear expression 
can be determined by whether a dependence exists with the transformed set of 
linear expressions. For any dependence distance value d (regarding loop index I ') 
computed with transformed linear expressions, the dependence distance in the 
original I iteration space can be computed based on d and the difference between 
corresponding r' . For example, suppose that we have a dependence between j\ = 
/(*i) = r*i , 1 + r l and j 2 = /(* 2 ) = T*i' 2 +i J 2 , with a dependence distance i' 2 — i\ — 
d. We have f(i 2 ) — /(*i) — T*d+r l 2 —r l 1 , from which we can further estimate * 2 — *i, 
maybe conservatively. 

As an example, we now show how we compute the r and a for the expression 
J = /(/) = C*I/D. 

If C*t*S is divisible by D, the equation f(i p+ i) — f(i p ) = /(* P + T _)_i) — f(i p+T ) 
will hold. To make C*t* S is divisible by D , we can let r = s D) where 

GCD(C * S, D) represents the greatest common divisor of C * S and D. 

Now, we show how our technique can determine whether a dependence exists 
between A(J + 9) and A( J) in Example 3 (Figure 5), i.e., whether there exist 
any instances of J, say j\ and j' 2 , and 

ji + 9 = j2 (4) 

has a solution. 

With our technique, the non-linear expression J = 5 * //4, where loop I’s 
iteration space is (1, 100, 3), can be represented equivalently by 

J = 15 * I' + r' , r' = (1, 5, 8, 12), I' has iteration space (0, 8, 1) (5) 

With the linear expression (5), equation (4) is equivalent to 

15 * i\ + r 1 + 9 = 15 * i 2 + r 2 , (6) 

where i\ and rq are used for j \ , and i 2 and r 2 for j 2 . 

To consider whether equation (6) has a solution or not, we have 
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15 * (*i — * 2 ) = (r 2 — ri) — 9 

= {1,5, 8, 12} -{1,5, 8, 12} -9 
= { — 11? — 7 , — 4, 0, 4, 7, 11} — 9 
= {-20, -16, -13, -9, -5, -2, 2} 

All possible values on the right-hand side are not divisible by 15, so there ex- 
ists no solution for (4) and no dependence between A{J+ 9) and A(J). Therefore, 
the loop I in Figure 5 can be parallelized successfully. 

Our index-association based dependence distance can help both general loop 
transformations and automatic parallelization because it tries to provide a more 
accurate dependence test result. In the next section, we particularly illustrate 
how our technique helps automatic parallelization, i.e., whether a certain level 
of loop is a DOALL loop or not, and under what condition it is a DOALL loop. 
We do not explore how our technique helps general loop transformations in this 
paper. 

6 Automatic Parallelization with Index Association 

For automatic parallelization, our index-association based dependence analysis 
can help determine whether a loop, which conforms to our program model in 
Figure 2 with some non-linear index-association functions A, is a DOALL loop 
or not. For those non-DOALL loops, previous work like [7] generate run-time 
conditionals under which the loop will be a DOALL loop, to guard the paral- 
lelized codes. Our compiler also has the ability to generate proper conditions 
under which a certain loop A is a DOALL loop, such as the example in Fig- 
ure 1. From Table 1, if the index-association function contains operators division 
and modulo , multiple step values may be generated in the dependence test iter- 
ation space, which makes dependence analysis conservative. To get more precise 
dependence analysis results, conditionals are often generated so that we can 
have fewer step values, often just one, in the dependence test iteration space 
for one index-assocation function. By combining index-association based depen- 
dence analysis and such two-version code parallelization, our compiler is able to 
parallelize some otherwise hard-to-parallelize loops. For example, our compiler 
is able to determine that the loops in Figures 3(a) and (b) are not DOALL loops 
and that the loop in Figure 3(c) is a DOALL loop, based on the dependence anal- 
ysis in Section 4. We will now work through a more complex example to show 
how we combine index-association based dependence analysis and two-version 
code parallelization to successfully parallelize one outer loop. 

Figure 6(a) shows the original code where ( 7 2 , C3 and S2 are all compile- 
time known constants and C\ is a loop nest invariant. We also suppose that all 
right-hand sides of assignments A(J + k) = . . . (0 < k < C 3 ) do not contain 
references to array A. The original iteration space for loop I 2 is (I\ C \ , (I-\ + 
1)(A , S 2 ) • With the property of index-association function DIV, we can derive 
the dependence test iteration space for J (corresponding to the original loop / 2 ) 
as (L^J* L (7l c? Cl ]>(L||J, r§D), where the step is variant with either 
or \ . Therefore, if the condition C 3 < {^-J holds, the loop I 2 is parallelizable. 
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DO /i = Lx, TJ\ 
DO I 2 = hCi, 

(h + l)Cx-l,S 2 
J = DIV(I 2 ,C 2 ) 
A(J) = ■ • • 

A(J + 1) = ... 

A( J + C 3 ) = . . . 

END DO 
END DO 



IF (MOD(Ci, S 2 ) = 0) THEN 
DO I 3 = L 1 C 1 ,(U 1 + 1)C 1 -1,S 2 
J = DIV(I 3 ,C 2 ) 

A(J) — ■ . . 

A(J + 1) = . . . 

A(J + C 3 ) = ... 

END DO 
ELSE 

DO /i = Li, C/i 

DO / 2 = I 1 C 1 ,(I 1 + 1)C 1 -1,S 2 
J = DIV(I 2 ,C 2 ) 

A(J) 

A(J+1) = ... 



A{J + C 3 ) = . . . 

END DO 
END DO 
END IF 



IF ((MOD(Cx, S 2 ) = 0).AND. 

(C 3 < L-gfJ)) then 

! The following loop is DOALL 
DO / 3 = LiCi,(C/i + l)Ci - 1,5 2 
J = DIV(I 3 ,C 2 ) 

A(J) = ... 

A(J+1) = ... 

A{J + C 3 ) = ... 

END DO 
ELSE 

DO h - L x , Ux 

DO I 2 = IxCx, (h + l)Ci - 1, S 2 
J = DIV(I 2 ,C 2 ) 

A(J) = ... 

A(J + 1) = ... 

A(J + C 3 ) 

END DO 
END DO 
END IF 



(a) 



(b) 



(c) 



Fig. 6. Example 4 



Parallelizing the outer loop I\ needs more analysis. Here, by analyzing the 
loop bounds and steps, our compiler is able to determine that if the condi- 
tion MOD{C\, S2) = 0 holds, i.e. , C\ is divisible by S2, the loops I\ and J 2 
actually can be collapsed into one loop. Figure 6(b) shows the code after loop 
collapsing. The new loop I3 in Figure 6(b) can be further parallelized if the 
condition C3 < holds, as analyzed in the previous paragraph. Figure 6(c) 
shows the final code where the collapsed loop I3 is parallelized under the con- 
dition MOD(C\, S2) = 0 and C3 < L^J- Our compiler is able to successfully 
parallelize the outer loop I\ in Figure 6(a). 

7 Experimental Results 

We have implemented our index-association based dependence analysis tech- 
nique in the Sun ONE Studio [tm] 8 compiler collection [1 1] , which will also be 
used in our experiments. (We have not implemented the technique presented in 
Section 5 yet. We plan to evaluate and experiment with it in future releases.) 
Our compiler has already implemented several dependence analysis techniques 
for subscripts which are linear forms of enclosing loop indices, such as GCD 
test, separability test, Banerjee test, etc. Our compiler also implements some so- 
phisticated techniques for array/scalar privatization analysis, symbolic analysis, 
parallelization-oriented loop transformations including loop distribution/fusion, 
loop interchange, wavefront transformation [12], etc. Therefore, our compiler 
can already parallelize a lot of loops in practice. With our new index-association 
based dependence analysis, we extend our compiler’s ability to parallelize more 
loop nests which otherwise cannot be parallelized. 
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We choose two programs from the well-known SPEC CPU2000 suite [10], 
swim and lucas, which benefit from the technique developed in this paper. In 
the second quarter of 2003, we submitted automatic parallelization results for 
SPEC CPU2000 on a Sun Blade [tm] 2000 workstation with 2 1200MHZ Ultra- 
SPARC III Cu [tm] processors to SPEC [10], which is the first such submission 
for SPEC CPU2000 on automatic parallelization. Compared to the results on 
Sun Blade [tm] 2000 with just 1 1200MHZ UltraSPARC III Cu [tm] proces- 
sor [10], we achieve a speedup of 1.60 for swim and a speedup of 1.14 for lucas. 
To evaluate the effectiveness of our technique on more than two processors, we 
further experimented on a Sun Fire [tm] 6800 server with 24 1200MHZ Ultra- 
SPARC III Cu [tm] processors and Solaris [tm] 9 operating system. For each 
program, we measure the best serial performance as well as the parallel perfor- 
mance with various number of processors up to 23 processors. We did not report 
the result for 24 processors as in general, due to system activity, it may not bring 
any speedup over the result for 23 processors. 

7.1 swim 

The benchmark swim is a weather prediction program written in Fortran. It is 
a memory bandwidth limited program and the tiling technique in [9], which has 
been implemented in our compiler, can improve data temporal cache locality, 
thus alleviating the bandwidth problem. For example, in one processor of our 
target machine, the code without tiling runs in 305 seconds and in 134 seconds 
with tiling. Tiling improves the performance for a single-processor run with 
a speedup of 2.28 because of the substantially improved cache locality. After 
tiling, however, some IF statements and MOD operators are introduced into the 
loop body because of aggressive loop fusion and circular loop skewing [9] , which 
makes it impossible to reuse the same dependence information derived before 
tiling. To parallelize such loop nests, our dependence analysis phase correctly 
analyzes the effect of IF statements and MOD operators, and generates proper 
conditions to parallelize all four most important loops. 

Figure 7(a) shows the speedup for swim with different number of proces- 
sors with and without our index-association based dependence analysis, repre- 
sented by “With IA-DEP” and “Without IA-DEP” respectively. Without index- 
association based dependence analysis, the tiled code is not able to be paral- 
lelized by our compiler. However, our compiler is still able to parallelize all four 
important loop nests if tiling is not applied. We regard the result for such paral- 
lelization as “Without IA-DEP” parallelization. For processor number equal to 
2, the actual “Without IA-DEP” parallelization performance is worse than the 
performance of the tiled code on one processor, so we use the result for the tiled 
code on one processor for “Without IA-DEP” result for two-processor result. 
From Figure 7(a), it is clear that our index-association based dependence can 
greatly improve parallel performance for swim. 

Figure 7(a) also shows that parallelization with IA-DEP scales better than 
without IA-DEP. This is because swim is a memory bandwidth limited bench- 
mark and tiling enables better scaling with most data accessed in L2 cache, which 



Index-Association Based Dependence Analysis 239 





(a) swim (b) lucas 

Fig. 7. Speedup on different number of processors for swim and lucas 



is local to each processor, instead of in main memory. This is true also with large 
data sizes in OpenMP version of swim. In March 2003, Sun submitted the perfor- 
mance results for 8/16/24 threads for SPEC OMPM2001 on Sun File [tm] 6800 
server [10]. The results show that without tiling, using OpenMP parallelization 
directives, the speedup from 8 threads to 16 threads is 1.33. With tiling, turning 
off OpenMP directive parallelization, however, the speedup is 1.44. The perfor- 
mance of with tiling is also significantly better than without tiling, e.g., SPEC 
scores 14199 vs. 8351 for 16 threads. 

7.2 lucas 

The benchmark lucas tests primality of Mersenne numbers. There are mainly 
two classes of loop nests in the program. One class is similar to our example 4 in 
Figure 6, and the other contains indexed array references, i.e. , array references 
appear in the subscripts. Currently, our compiler is not able to parallelize loops 
in the second class. However, with index-association based dependence analysis, 
it is able to parallelize all important loops in the first class. Figure 7(b) shows 
the speedup for lucas on different number of processors. Note that no speedup is 
achieved for multiple processor runs without index-association based dependence 
analysis since all important loops are not parallelized. 

8 Conclusion 

In this paper, we have presented a new dependence analysis technique called 
index-association based dependence analysis. Our technique targets a special 
class of loop nests and uses a decoupled approach for dependence analysis of 
complex array subscripts. We also present a technique to transform a non-linear 
expression to a set of linear expressions and the latter can be used in dependence 
test with traditional techniques. Experiments show that our technique is able to 
help parallelize some otherwise hard-to-parallelize loop nests. 
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Abstract. Hierarchically-blocked non-linear storage layouts, such as the 
Morton ordering, have been proposed as a compromise between row- 
major and column-major for two-dimensional arrays. Morton layout of- 
fers some spatial locality whether traversed row-wise or column-wise. 
The goal of this paper is to make this an attractive compromise, offer- 
ing close to the performance of row-major traversal of row-major layout, 
while avoiding the pathological behaviour of column-major traversal. We 
explore how spatial locality of Morton layout depends on the alignment 
of the array’s base address, and how unrolling has to be aligned to reduce 
address calculation overhead. We conclude with extensive experimental 
results using five common processors and a small suite of benchmark 
kernels. 



1 Introduction 

Programming languages that offer support for multi-dimensional arrays gener- 
ally use one of two linear mappings to translate from multi-dimensional array 
indices to locations in the machine’s linear address space: row-major or column- 
major. Traversing an array in the same order as it is laid out in memory leads 
to excellent spatial locality; however, traversing a row-major array in column- 
major order or vice-versa, can lead to an order-of-magnitude worse performance. 
Morton order is a hierarchical, non-linear mapping from array indices to mem- 
ory locations which has been proposed by several authors as a possible means 
of overcoming some of the performance problems associated with lexicographic 
layouts [2, 4, 10, 12]. The key advantages of Morton layout are that the spatial 
locality of memory references when iterating over a Morton order array is not 
biased towards either the row-major or the column major traversal order and 
that the resulting performance tends to be much smoother across problem-sizes 
than with lexicographic arrays [2]. Storage layout transformations, such as using 
Morton layout, are always valid. These techniques complement other methods 
for improving locality of reference in scientific codes, such as tiling, which rely 
on accurate dependence and aliasing information to determine their validity for 
a particular loop nest. 



L. Rauchwerger (Ed.): LCPC 2003, LNCS 2958, pp. 241-257, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Previous Work. In our investigation of Morton layout, we have thus far con- 
fined our attention to non-tiled codes. We have carried out an exhaustive inves- 
tigation of the effect of poor memory layout and the feasibility of using Morton 
layout as a compromise between row-major and column-major [8]. Our main 
conclusions thus far were 

— It is crucial to consider a full range of problem sizes. 

The fact that lexicographic layouts can suffer from severe interference prob- 
lems for certain problem sizes means that it is important to consider a full 
range of randomly generated problem sizes when evaluating the effectiveness 
of Morton layout [8] . 

— Morton address calculation: table lookup is a simple and effective solution. 
Production compilers currently do not support non-linear address calcula- 
tions for multi-dimensional arrays. Wise et al. [12] investigate the effective- 
ness of the “dilated arithmetic” approach for performing the address calcu- 
lation. We have found that a simple table lookup scheme works remarkably 
well [8]. 

— Effectiveness of Morton layout. 

We found that Morton layout can be an attractive compromise on machines 
with large L2 caches, but the overall performance has thus far still been dis- 
appointing. However, we also observed that only a relatively small improve- 
ment in the performance of codes using Morton layout would be sufficient to 
make Morton storage layout an attractive compromise between row-major 
and column-major. 



Contributions of this Paper. We make two contributions which can improve 
the effectiveness of the basic Morton scheme and which are both always valid 
transformations . 

— Aligning the Base Address of Morton Arrays (Section 2). 

A feature of lexicographic layouts is that the exact size of an array can influ- 
ence the pattern of cache interference misses, resulting in severe performance 
degradation for some datasizes. This can be overcome by carefully padding 
the size of lexicographic arrays. In this paper, we show that for Morton 
layout arrays, the alignment of the base address of the array can have a sig- 
nificant impact on spatial locality when traversing the array. We show that 
aligning the base address of Morton arrays to page boundaries can result in 
significant performance improvements. 

— Unrolling Loops over Morton Arrays (Section 3). 

Most compilers unroll regular loops over lexicographic arrays. Unfortunately, 
current compilers cannot unroll loops over Morton arrays effectively due to 
the nature of address calculations: unlike with lexicographic layouts, there 
is no general straight-forward (linear) way of expressing the relationship 
between array locations A [i] [j] and A [i] [j+1] which a compiler could ex- 
ploit. We show that, provided loops are unrolled in a particular way, it is 
possible to express these relationships by simple integer increments, and we 
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demonstrate that using this technique can significantly improve the perfor- 
mance of Morton layout. 



1.1 Background: Morton Storage Layout 

Lexicographic Array Storage. For an M x N two-dimensional array A, 
a mapping S(i,j) is needed, which gives the memory offset at which array ele- 
ment Aj j will be stored. Conventional solutions are row-major (for example in 
C and Pascal) and column-major (as used by Fortran) mappings expressed by 

Srm N) (i,j) = N x i + j and S^’ N) ( i,j ) =i + M x j 

respectively. We refer to row-major and column-major as lexicographic, i.e. ele- 
ments are arranged by the sort order of the two indices (another term is “canon- 
ical”). 

Blocked Array Storage. Traversing a row-major array in column-major or- 
der, or vice-versa, leads to poor performance due to poor spatial locality. An 
attractive strategy is to choose a storage layout which offers a compromise be- 
tween row-major and column-major. For example, we could break the M x N 
array into small, P x Q row-major subarrays, arranged as a M/P x N/Q row- 
major array. We define the blocked row-major mapping function (this is the 4D 
layout discussed in [2]) as: 

sir m N) ^j) = (P X Q) x SW p < N M(i/P,j/P) + S™(i%P, 3%Q) 



Row major traversal: one in four 
accesses hits a new cache line 
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Column-major traversal: one in four accesses 
' hits a new cache line. 



Fig. 1. Blocked row-major (“4D”) lay- 
out with block-size P = Q = 4. The dia- 
gram illustrates that with 16-word cache 
lines, illustrated by different shadings, 
the cache hit rate is 75% whether the ar- 
ray is traversed in row-major or column- 
major order 



0 1234567 




Fig. 2. Morton storage layout for an 
8x8 array. Location of element A[5, 4] is 
calculated by interleaving “dilated” rep- 
resentations of 5 and 4 bitwise: Do (5) = 
IOOOIO 2 , Di(4) = OIOOOO 2 . Smz(5, 4) = 
Do (5) | Di(4) = IIOOIO 2 =50i O 



244 Jeyarajan Thiyagalingam et al. 



Table 1. Theoretical hit rates for row- major traversal of a large array of double words 
on different levels of memory hierarchy. Possible conflict misses or additional hits due 
to temporal locality are ignored. This illustrates the compromise nature of Morton 
layout 





Row-major layout 


Morton layout 


Column-major layout 


32B cache line 


75% 


50% 


0% 


128B cache line 


93.75% 


75% 


0% 


8kB page 


99.9% 


96.875% 


0% 



For example, consider 16-word cache blocks and P = Q = 4, as illustrated in 
Figure 1. Each block holds a P x Q = 16-word subarray. In row-major traversal, 
the four iterations (0,0), (0,1), (0,2) and (0,3) access locations on the same 
block. The remaining 12 locations on this block are not accessed until later 
iterations of the outer loop. Thus, for a large array, the expected cache hit rate is 
75%, since each block has to be loaded four times to satisfy 16 accesses. The same 
rate results with column-major traversal. Most systems have a deep memory 
hierarchy, with block size, capacity and access time increasing geometrically with 
depth [1]. Blocking should therefore be applied for each level. Note, however, that 
this becomes very awkward if larger blocksizes are not whole multiples of the 
next smaller blocksize. 



Bit-Interleaving and Morton Layout. Assume for the time being that, for 
an M x N array, M = 2 m , N = 2 n . Write the array indices i and j as 

B{i) = im-iim -2 ■ ■ ■ *i*o and B(j) = j n -ijn -2 ■ ■ ■ jijo 

respectively. From this point, we restrict our analysis to square arrays (where 
M = N). Now the lexicographic mappings can be expressed as bit-concatenation 
(written “||”): 

S rm N \hj) =N Xi + j = B(i)\\B(j)= i n -!i n -2 . . .iiiojn-ljn-2 ■ ■ -jljo 
5 im ,JV) (b j) =i + M X j= B(j)\\B(i) = jn-ljn-2 ■ ■ ■ jl jo*n-l*n-2 • • ■ *1*0 

If P = 2 P and Q = 2 9 , the blocked row-major mapping is 

S^ m N \hj) = (P xQ) x SW p ’ N 'Q\i,j) + S£Q\i%P,j%Q) 

= B(i)( n _i)'„ p ||B(j) (m _ 1) ... 9 ||B(i) (p _ 1) ... 0 ||B(j)(g_i)...o 
Now, choose P = Q = 2, and apply blocking recursively: 

^ rnz Hbj) = *ro— ljn— l*rt— 2jn — 2 • • • *ljl*0j0 



This mapping is called the Morton Z-order, and is illustrated in Figure 2. 
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Morton Layout can be an Unbiased Compromise Between Row-Major 
and Column-Major. The key property which motivates our study of Morton 
layout is the following: Given a cache with any even power-of-two block size, 
with an array mapped according to the Morton order mapping S mz , the cache 
hit rate of a row-major traversal is the same as the cache-hit rate of a column- 
major traversal. This applies given any cache hierarchy with even power-of-two 
block size at each level. This is illustrated in Figure 2. The cache hit rate for 
a cache with block size 2 2k is 1 — (l/2 fc ). 

Examples. For cache blocks of 32 bytes (4 double words, k = 1) this gives 
a hit rate of 50%. For cache blocks of 128 bytes (k = 2) the hit rate is 75% as 
illustrated earlier. For 8kB pages, the hit rate is 96.875%. In Table 1, we contrast 
these hit rates with the corresponding theoretical hit rates that would result from 
row-major and column-major layout. Notice that traversing the same array in 
column-major order would result in a swap of the row-major and column-major 
columns, but leave the hit rates for Morton layout unchanged. In Section 2, we 
show that this desirable property of Morton layout is conditional on choosing a 
suitable alignment for the base address of the array. 



Morton-Order Address Calculation Using Dilated Arithmetic or Table 
Lookup. Bit-interleaving is too complex to execute at every loop iteration. Wise 
et al. [12] explore an intriguing alternative: represent each loop control variable i 
as a “dilated” integer, where the *’s bits are interleaved with zeroes. Define Vq 
and V i such that 

B(V 0 (i)) = 0* n _i0*„_2 ■ ■ ■ 0*i0f 0 and B(T>i(i)) = z n _i0* n _ 2 0 . . . ii0i o 0 

Now we can express the Morton address mapping as Srn Z M> ( i,j ) = V i (i) | D 0 (j), 
where “|” denotes bitwise-or. At each loop iteration we increment the loop control 
variable; this is fairly straightforward. Let denote bitwise-and. Then: 

V 0 (i + 1) = ((T>o(*) I Ones 0 ) + 1) & Onesi 

T>i(i + 1) = | Onesi) + 1) & Oneso where 

B(Oneso) = 10101 . . . 01010 and S(Onesi) = 01010 . . . 10101 . 

This approach works when the array is accessed using an induction variable 
which can be incremented using dilated addition. We found that a simpler scheme 
often works nearly as well: we simply pre-compute a table for the two mappings 
T>o(i) and Table accesses are likely cache hits, as their range is small and 

they have unit stride. 



2 Alignment of the Base Address of Morton Arrays 

With lexicographic layout, it is often important to pad the row or column length 
of an array to avoid associativity conflicts [7]. With Morton layout, it turns 
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Fig. 3. Alignment of Morton-order Arrays. This figure shows the impact of mis- 
aligning the base address of a 4 x 4 Morton array from the alignment of a 4- word cache 
line. The numbers next to each row and below each column indicate the number of 
misses encountered when traversing a row (column) of the array in row-major (column- 
major) order, considering only spatial locality. Underneath each diagram, we show the 
average theoretical hit rate for the entire Morton array for both row-major (RM) and 
column-major (CM) traversal 



out to be important to pad the base address of the array. In our discussion of 
the cache hit rate resulting from Morton order arrays in the previous section, 
we have implicitly assumed that the base address of the array will be mapped 
to the start of a cache line. For a 32 byte, i.e. 2x2 double word cache line, 
this would mean that the base address of the Morton array is 32-byte aligned. 
As we have illustrated previously in Section 1.1, such an allocation is unbiased 
towards any particular order of traversal. However, in Figure 3 we show that if 
the allocated array is offset from this “perfect” alignment, Morton layout may 
no longer be an unbiased compromise storage layout: The average miss-rate of 
traversing the array, both in row- and in column-major order, is always worse 
when the alignment of the base address is offset from the alignment of a 4-word 
cache line. Further, when the array is mis-aligned, we lose the symmetry property 
of Morton order being an unbiased compromise between row- and column-major 
storage layout. 
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Memory Hierarchy Miss Rates for Column-Major Traversal of a 
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Fig. 4. Miss-rates for row-major and column-major traversal of Morton 
arrays. We show the best, worst and average miss-rates for different units of memory 
hierarchy (referred to as blocksizes), across all possible alignments of the base address of 
the Morton array. The top two graphs use a linear y-axis, whilst the graph underneath 
uses a logarithmic y-axis to illustrate that the pattern of miss-rates is in fact highly 
structured across all levels of the memory hierarchy 

Systematic Study Across Different Levels of Memory Hierarchy. In 

order to investigate this effect further, we systematically calculated the resulting 
miss-rates for both row- and column-major traversal of Morton arrays, over 
a range of possible levels of memory hierarchy, and for each level, different miss- 
alignments of the base address of Morton arrays. The range of block sizes in 
memory hierarchy we covered was from 2 2 double words, corresponding to a 32- 
byte cache line to 2 10 double words, corresponding to an 8kB page. Architectural 
considerations imply that block sizes in the memory hierarchy such as cache 
lines or pages have a power-of-two size. For each 2 n block size, we calculated, 
over all possible alignments of the base address of a Morton array with respect 
to this block size, respectively the best, worst and average resulting miss-rates 
for both row-major and column-major traversal of the array. The standard C 
library mallocO function returns addresses which are double- word aligned. We 
therefore conducted our study at the resolution of double words. The results of 
our calculation are summarised in Figure 4. Based on those results, we offer the 
following conclusions. 
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1. The average miss-rate is the performance that might be expected when no 
special steps are taken to align the base address of a Morton array. We note 
that the miss rates resulting from such alignments are always suboptimal. 

2. The best average hit rates for both row- and column-major traversal are 
always achieved by aligning the base address of Morton arrays to the largest 
significant block size of memory hierarchy (e.g. page size). 

3. The difference between the best and the worst miss-rates can be very signif- 
icant, up to a factor of 2 for both row-major and column-major traversal. 

4. We observe that the symmetry property which we mentioned in Section 1.1 
is in fact only available when using the best alignment and for even power- 
of-two block sizes in the memory hierarchy. For odd power-of-two block sizes 
(such as 2 3 = 8 double words, corresponding to a 64-byte cache line), we 
find that the Z-Morton layout is still significantly biased towards row-major 
traversal. An alternative recursive layout such as Hilbert layout [6, 3] may 
have better properties in this respect. 

5. The absolute miss-rates we observe drop exponentially through increasing 
levels of the memory hierarchy (see the graphs in Figure 4). However, if we 
assume that not only the block size but also the access time of different 
levels of memory hierarchy increase exponentially [1], the penalty of miss- 
alignment of Morton arrays does not degrade significantly for larger block 
sizes. From a theoretical point of view, we therefore recommend aligning the 
base address of all Morton arrays to the largest significant block size in the 
memory hierarchy, i.e. page size. 

In real machines, there are conflicting performance issues apart from maximising 
spatial locality, such as aliasing of addresses that are identical modulo some 
power-of-two, and some of these could negate the benefits of increased spatial 
locality resulting from making the base address of Morton arrays page-aligned. 



Experimental Evaluation of Varying the Alignment of the Base Ad- 
dress of Morton Arrays. In our experimental evaluation, we have studied 
the impact on actual performance of the alignment of the base address of Mor- 
ton arrays. For each architecture and each benchmark, we have measured the 
performance of Morton layout both when using the system’s default alignment 
(i.e. addresses as returned by mallocO) and when aligning arrays to each sig- 
nificant size of memory hierarchy. Our experimental methodology is described 
in Section 3.1. Detailed performance figures showing the impact of varying the 
alignment of the base address of Morton arrays over all significant levels of 
memory hierarchy are contained in an accompanying technical report [9]. Our 
theoretical assertion that aligning with the largest significant block size in the 
memory hierarchy, i.e. page size, should always be best is supported in most, but 
not all cases, and we assume that where this is not the case, this is due to inter- 
ference effects. Figures 5-8 of this paper include performance results for Morton 
storage layout with default- and page-alignment of the array’s base address. 
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3 Unrolling Loops over Morton Arrays 

Linear array layouts have the following property. Let £((*■)) be the address 
calculation function which returns the offset from the array base address at 
which the element identified by index vector ( j) is stored. Then, for any offset- 
vector ( * ) , we have 

£(0 ) + (?))=£(})+£(?) ■ (!) 

As an example, for a row-major array A, A(i, j+k) is stored at location A(i, j)+k. 
Compilers can exploit this transformation when unrolling loops over arrays with 
linear array layouts by strength-reducing the address calculation for all except 
the first loop iteration in the unrolled loop body to simple addition of a constant. 

As stated in Section 1.1, the Morton address mapping is S mz (i,j) = T>i(i) \ 
T>o(j ), where “|” denotes bitwise-or, which can be implemented as addition. 
Given offset k, 

Smz(i, j + k) = | V 0 (j + k)= V x (i) + V 0 (j + k) . 

The problem is that there is no general way of simplifying T>o(j + k) for all j 
and all k. 

Proposition 1 (Strength-reduction of Morton address calculation). Let 

u be some power-of-two number such that u = 2™. Assume that j mod u = 0 and 
that k < u. Then, 

V 0 (j + k) = V 0 (j) + V 0 (k) . (2) 

This follows from the following observations: If j mod u = 0 then the n least 
significant bits of j are zero; if k < u then all except the n least significant 
bits of k are zero. Therefore, the dilated addition 2?o(j + k) can be performed 
separately on the n least significant bits of j. 

As an example, assume that j mod 4 = 0. Then, the following strength- 
reductions of Morton order address calculation are valid: 

Smz(i, j + 1) = 2?i(*) + -Do(j) + 1 
Smz{h j + 2) = T>i(i) + Vq (j) + 4 
Smzihj + 3) = + T>o(j) + 5 

An analogous result holds for the i index. Therefore, by carefully choosing the 
alignment of the starting loop iteration variable with respect to the array in- 
dices used in the loop body and by choosing a power-of-two unrolling factor, 
loops over Morton order arrays can benefit from strength-reduction in unrolled 
loops. In our implementation, this means that memory references for the Morton 
tables are replaced by simple addition of constants. Existing production com- 
pilers cannot find this transformation automatically. We therefore implemented 
this unrolling scheme by hand in order to quantify the possible benefit. We report 
very promising initial performance results in Section 3.1. 
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Table 2. Numerical kernels used in our evaluation, together with their 
baseline performance on the different platforms used. For each kernel, for each 
machine, we show the performance range in MFLOPs for row-major array layout over 
all problem sizes covered in our experiments 





Adi 


Cholk 


Jacobi2D 


MMijk 


MMikj 




Alternating-di- 
rection implicit 
kernel, ij-ij order 


Cholesky 

k-variant 


Two- 

dimensional 
four-point 
stencil smoother 


Matrix multiply, 
ijk loop nest or- 
der 


Matrix multiply, 
ikj loop nest or- 
der 




min 


max 


min 


max 


min 


max 


min 


max 


min 


max 


Alpha 


27.0 


84.5 


4.8 


41.1 


24.2 


167.1 


6.0 


139.5 


37.6 


177.0 


Athlon 


43.8 


210.4 


0° 

bo 


308.5 


150.6 


1078.6 


9.5 


262.5 


117.4 


884.2 


P3 


13.7 


46.6 


3.9 


42.1 


38.7 


122.3 


15.5 


91.8 


46.4 


153.8 


P4 


46.2 


134.1 


4.8 


266.1 


159.6 


1337.3 


12.6 


147.3 


281.4 


939.1 


Sparc 


11.4 


54.3 


3.5 


78.4 


33.2 


138.6 


5.0 


131.9 


20.5 


142.8 



3.1 Experimental Evaluation 

Benchmark Kernels and Architectures. To test our hypothesis that Mor- 
ton layout is a useful compromise between row-major and column-major layout 
experimentally, we have collected a suite of simple implementations of standard 
numerical kernels operating on two-dimensional arrays and carried out experi- 
ments on five different architectures. The kernels used are shown in Table 2 and 
the platforms in Table 3. 



Performance Results. Figures 5-8 show our results in detail, and we make 
some comments directly in the figures. We have carried out extensive measure- 
ments over a full range of problem sizes: the data underlying the graphs in 
Figures 5-8 consist of more than 25 million individual measurements. For each 
experiment / architecture pair, we give a broad characterisation of whether Mor- 
ton layout is a useful compromise between row-major and column-major in this 
setting by annotating the figures with win , lose , etc. 



Impact of Unrolling. By inspecting the assembly code, we established that 
at least the icc compiler on x86 architectures does automatically unroll our 
benchmark kernels for row-major layout. In Figures 5-8, we show that manually 
unrolling the loops over Morton arrays by a factor of four, using the technique 
described in Section 3, can result in a significant performance improvement of 
the Morton code: On several architectures, the unrolled Morton codes are for 
part of the spectrum of problem sizes very close to, or even better than, the 
performance of the best canonical code. We plan to explore this promising result 
further by investigating larger unrolling factors. 
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Table 3. Cache and CPU configurations used in the experiments. Compilers 
and compiler flags match those used by the vendors in their SPEC CFP2000 (base) 
benchmark reports 



System 


Processor 


Operating 

System 


Ll/L2/Memory 

Parameters 


Compiler 
and Flags Used 


Alpha 

Compaq 

AlphaServer 

ES40 


Alpha 21264 
(EV6) 500MHz 


OSF1 V5.0 


LI D-cache: 2-way, 64KB, 64B cache line 
L2 cache: direct mapped, 4MB 
Page size: 8KB 
Main Memory: 4GB RAM 


Compaq C 
Compiler V6. 1-020 
-arch ev6 -fast -04 


Sun 

SunFire 6800 


UltraSparcIII(v9) 

750MHz 


SunOS 5.8 


LI D-cache: 4-way, 64KB, 32B cache line 
L2 cache: direct-mapped, 8MB 
Page size: 8KB 
Main Memory: 24GB 


Sun Workshop 6 
-fast -xcrossfi le 
-xaliasJevel=std 


PHI 


Pentiumlll 

Coppermine 

450MHz 


Linux 2.4.20 


LI D-cache: 4- way, 16KB, 32B cache line 
L2 cache: 4-way 512KB, sectored 32B cache line 
Page size: 4KB 

Main Memory: 256MB SDRAM 


Intel C/C-H- 
Compiler v7.1 
-xK -ipo 
-03 -static 


P4 


Pentium 4 
2.0 GHz 


Linux 2.4.20 


LI D-cache: 4-way, 8KB, sectored 64B cache line 
L2 cache: 8-way, 512KB, sectored 128B cache line 
Page size: 4KB 

Main Memory: 512MB DDR-RAM 


Intel C/C++ 
Compiler v7.1 
-xW -ipo 
-03 -static 


AMD 


AMD Athlon 
XP 2100+ 1.8GHz 


Linux 2.4.20 


LI D-Cache: 2- way, 64KB, 64B cache line 
L2 cache: 16-way, 256KB, 64B cache line 
Page size: 4KB 

Main Memory: 512MB DDR-RAM 


Intel C/C+- 1 - 
Compiler v7.1 
-xK -ipo 
-static 



4 Related Work and Conclusions 

Related Work. Chatterjee et al. [2] study Morton layout and a blocked “4D” 
layout. They focus on tiled implementations, for which they find that the 4D 
layout achieves higher performance than the Morton layout because the ad- 
dress calculation problem is easier, while much or all the spatial locality is still 
exploited. Their work has similar goals to ours, but all their benchmark applica- 
tions are tiled for temporal locality; they show impressive performance, with the 
further advantage that performance is less sensitive to small changes in tile size 
and problem size, which can result in cache associativity conflicts with conven- 
tional layouts. In contrast, the goal of our work is to evaluate whether Morton 
layout can simplify the performance programming model presented by compilers 
for languages with multi-dimensional arrays. 

Wise et al. [11] argue for compiler-support for Morton order matrices. They 
use a recursive implementation of loops over Morton arrays, with recursion un- 
folding and re-rolling into small loops. However, they find it hard to overcome 
the cost of addressing without recursion. 

Gustavson at al. [5] show that complementing a tiled implementation of 
BLAS-3 routines with a recursively blocked storage layout can lead to additional 
performance improvements. 



Conclusions. We believe that work on nonlinear storage layouts, such as Mor- 
ton order, is applicable in a number of different areas. 
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— Simplifying the performance-programming model offered to application pro- 
grammers is one important objective of language design and compiler re- 
search. We believe that the work presented in this paper can reduce the 
price of the attractive properties offered by Morton layout over canonical 
layouts. 

— Storage layout transformations are always valid and can be applied even in 
codes where tiling is not valid or hard to apply. Store layout transformation 
can thus be additional and complementary to iteration space transforma- 
tions. 



Future Work. We have reason to believe that unrolling loops over Morton 
arrays by factors larger than four is likely to yield greater benefits than we 
have measured thus far. We are also planning to investigate the performance of 
Morton layout in tiled codes and software-directed pre-fetching for loops over 
Morton arrays. We believe that the techniques we have presented in this paper 
facilitate an implementation of Morton layout for two-dimensional arrays that 
is beginning to fulfil its theoretical promise. 
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— Notice for Alpha, the upper 
limit is 1024 x 1024. 

— For Alpha (Sun), the fall-off in 
RM performance occurs at 
725 x 725 (1024 x 1024) when 
the total datasize exceeds L2 
cache size of 4MB (SMB), direct 
mapped. This assumes 

a working set of 725 x 725 
(1024 x 1024) doubles. 

— Alignment significantly improves 
performance of the default 
Morton scheme on P3. On other 
platforms, alignment also yields 
slight improvements. 



Fig. 5. ADI performance in MFLOPs on different platforms. We compare 
row-major, column-major, Morton with default alignment of the base address of the 
array, Morton with page-aligned base address and unrolled-Morton with page-aligned 
base address and factor 4 loop unrolling 
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Jacobi2D on Sparc: Performance in MFLOP/s 
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Jacobi2D on P3: Performance in MFLOP/s 
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— On Alpha, Sparc and P3, the 
page-aligned Morton version 
improves over the basic Morton 
scheme. 

— Unrolling improves performance 
of the best aligned Morton 
implementation, in particular on 
x86 where the unrolled Morton 
performance is within reach of 
the best canonical. 



Fig. 6. Jacobi2D performance in MFLOPs on different platforms. We com- 
pare row-major, column-major, Morton with default alignment of the base address of 
the array, Morton with page-aligned base address and Morton with page-aligned base 
address and factor 4 loop unrolling 



256 Jeyarajan Thiyagalingam et al. 



Win over CM for problem sizes 
larger than about 330 x 330 



Win over CM for problem sizes 
larger than about 500 x 500 



MMikj on Alpha: Performance in MFLOP/s 
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— For Alpha and P3, notice that 
upper limit is 1024 x 1024. 

— On all platforms except Sparc, 
unrolling yields a significant 
improvement over the basic 
Morton scheme. 



Fig. 7. MMikj performance in MFLOPs on different platforms. We com- 
pare row-major, column-major, Morton with default alignment of the base address of 
the array, Morton with page-aligned base address and Morton with page-aligned base 
address and factor 4 loop unrolling 
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— For Alpha, notice that the 
upper limit is 1024 x 1024. 

— Notice the sharp drop in RM 
and CM performance on Alpha 
(around 360 x 360) and on Sparc 
(around 700 x 700) platforms . 

— On all platforms except Sparc, 
unrolling yields a significant 
improvement over the basic 
Morton scheme. 



Fig. 8. MMijk performance in MFLOPs on different platforms. We com- 
pare row-major, column-major, Morton with default alignment of the base address of 
the array, Morton with page-aligned base address and Morton with page-aligned base 
address and factor 4 loop unrolling 
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Abstract. Networks of embedded systems, in the form of cell phones, 
PDAs, wearable computers, and sensors connected through wireless net- 
working technology, are emerging as an important computing platform. 
The ubiquitous nature of such a platform promises exciting applications. 
This paper presents a new programming model for a network of embed- 
ded systems, called Spatial Views, targeting its dynamic, space-sensitive 
and resource-restrained characteristics. The core of the proposed model 
is iterative programming over a dynamic collection of nodes identified 
by the physical spaces they are in and the services they provide. Hid- 
den in the iteration is execution migration as the main collaboration 
paradigm, constrained by user specified limits on resource usage such as 
response time and energy consumption. A Spatial Views prototype has 
been implemented, and first results are reported. 



1 Introduction 

The possibility of building massive networks of embedded systems (NES) has 
become a reality. For instance, cell phones, PDA’s, and other gadgets carried by 
passengers on a train can form an ad hoc network through wireless connection. 
In addition to those volatile and dynamic nodes, the network may contain fixed 
nodes installed on the train, for instance public displays, keyboards, sensors, 
or Internet connections. Similar networks can be established across buildings, 
airports or even on highways among car-mounted computers. Any device with 
a processor, some memory and a network connection, probably integrated on 
a single chip, can join such a network. The application of such a network is 
limited only by our imagination, if we had the right programming models and 
abstractions. 

Existing programming models do not address key issues for applications that 
will run on a network of embedded systems. 

Physical Locations: An application has a physical target space region, i.e. , 
a space of interest in which it executes. The semantics of a program executing 
outside its target space in not defined. For instance, it makes a difference 
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if an application collects temperature reading within a building or outside 
a building, and whether all or only a subset of temperature sensors are 
to be polled. A motion sensor reading may trigger the activation of other 
sensors but only of those which are in the spatial proximity of the motion 
sensor. A programmer must be able to specify physical spaces and location 
constraints over these spaces. 

Volatile and Dynamic Networks: Nodes may join and leave at any time, 
because of movements or failure. Portable devices or sensors, carried by 
a person or an animal[l], may go out the space of interest while they are 
moving with the carriers. Battery powered small devices may go out of power 
at any point. A node available at time t can not be assumed available at time 
t + At or time t — At, where At can be very small relative to the application 
execution time. 

Resource Constraints: Resources like energy and execution time are limited 
in a network of embedded systems, due to the hardware form factor and 
application characteristics. Graceful degradation of quality of results is nec- 
essary in such an environment. Instead of draining the battery of the sen- 
sors, you might want to limit the total energy used by a program and accept 
a slightly worse answer. Or you may limit the response time of a query for 
traffic information 10 miles ahead on the highway, so you will have enough 
time to choose a detour after getting the answer. In those cases, energy- 
wasting or late answers are not better or even worse than no answer. Pro- 
grammers should be able to specify the amount of resource used during 
a program execution, so trade-offs between quality of results and resource 
usage can be made. 

In this paper, we introduce Spatial Views, a novel programming model tar- 
geting networks of embedded systems. Spaces, services and resource constraints 
are explicit programming elements in Spatial Views. Spaces and services are 
combined to define dynamic collections of interesting nodes in a network, called 
Spatial Views. Iterators and Selectors specify code to execute in a view under 
a specified time constraint, and possibly additional user specified resource con- 
straints. These high level program constructs are based on a migratory execution 
model, guided by the space and service of interest. However, Spatial Views does 
not exclude an implementation using other communication mechanisms, such as 
remote procedure calls, message passing or even socket programming, for per- 
formance or energy efficiencies. 

Network or node failures are transparent to the programming model. How- 
ever, there is no guarantee that the execution of an application will be able 
to complete successfully. Our proposed model is not fault tolerant, but allows 
answers of different qualities. In contrast, in a traditional programming model 
for a stable target system, any answer is considered to have perfect quality. In 
our programming model, it is the responsibility of the programmer to assess 
the quality of an answer. For example, if a user wants the average temperature 
calculated from readings of at least ten network nodes, he or she should report 
the average temperature together with the number of actually visited nodes to 



260 Yang Ni et al. 



assess the quality of the answer. A best-effort compiler and run-time system will 
try to visit as many nodes as possible, as long as no user defined constraint is 
violated, assuming that visiting more nodes will produce a potentially better an- 
swer. A target space and a time constraint have to be specified for each program 
to confine its execution, including node discovery, to a space x time interval. 

Security and privacy issues are also important in a NES but are not currently 
part of our programming model. The same application will run on a secure 
network as well as on a insecure network. We assume that security-sensitive 
hosts will implement authentication and protection policies at a lower level than 
the programming model. 

Smart Messages [2] and Spatial Programming [3] are possible implementation 
platforms for our proposed Spatial Views programming model. A programming 
environment for execution migration that includes protection and encryption for 
Smart Messages is currently under investigation [4] , which could be used as an 
secure infrastructure to implement our programming model. However, in this 
paper we describe an implementation of Spatial Views on top of Sun’s K Virtual 
Machine (KVM) independent of Smart Messages. 

In the rest of this paper, we will present a survey of related works (section 
2), the programming model (section 3), a discussion of the implementation of 
a prototype system (section 4) and experimental results (section 5) . 

2 Related Work 

Our work is correlated to recent work on sensor networks[5, 6, 7, 8, 9] in that they 
all target ad hoc networks of wireless devices with limited resources. However, 
we broaden the spectrum of network nodes to include more computing powerful 
devices like PDA’s, cell phones and even workstations or servers. 

TinyOS[5] and nesC[7] provide a component-based event-driven program- 
ming environment for Motes. Motes are small wireless computing devices that 
have processors of a couple of MHz, about 4KB RAM and 10Kbps wireless com- 
munication capability. TinyOS and nesC use Active Messages as the communi- 
cation paradigm. Active Messages has a similar flavor to execution migration 
of Spatial Views, but use non-migrating handlers instead of migrating code. 
Mate [6] is a tiny virtual machine built on top of TinyOS for sensor networks. It 
allows capsules, i.e. Mate programs, in bytecode to forward themselves through 
a network with a single instruction, which bears the resemblance to execution 
migration in Spatial Views. Self forwarding enables on-line software upgrading, 
which is important in large-scale sensor networks. 

Next, we are going to discuss related work about services and locations. We 
will also discuss related work about execution migration, which is used in the 
implementation of the prototype for our programming model. 

2.1 Service Discovery 

Service discovery is a research area with a long history. Service is usually specified 
either as an interface (like in Jini,) [10] or as a tuple of attribute-and- value pairs 
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(like in INS.) [11] Attribute-and- value pairs describe a hierarchical service space 
by adding new attributes and corresponding values in a describing tuple. The 
same goal can be achieved through interface sub-typing. 

Spatial Views programming model specify services as interfaces. Applications 
and services agree on the semantics of the methods of the services. We assume 
that the operating system provides service discovery as a basic function. How- 
ever, we did implement a simple service discovery in the Spatial Views runtime 
libraries using the random walk technique. 

2.2 Location Technology 

GPS[12] is the most developed positioning technology. It is all-weather world- 
wide available with very high accuracy regarding its scale, 16 meters for absolute 
positions and 1 meter for relative positions. In spite of its many advantages, GPS 
is only available outdoor and its accuracy is still not satisfactory for many mo- 
bile computing applications. In recent years, more accurate in-door positioning 
technologies have been developed by the mobile computing community. Active 
Badges and Bats[13, 14, 15] are tracking systems as accurate as to a few cen- 
timeters. Each object is attached a RFID and tracked by a centralized sys- 
tem. Although accurate, Active Badges and Bats are costly and hard to deploy. 
User privacy is not protected since everyone with a tag exposes his/her position 
by sending out radio signals. The central machine in charge of analyzing each 
user’s position causes scalability problem and represents a single point of failure. 
Cricket[16, 17] tries to address those issues by using a distributed and passive 
architecture similar to GPS. Cricket is based on special beacons and receivers 
and uses time of fly of radio and ultrasound signals to locate. It provides a pre- 
cision to a few meters. RADAR [18] is also a passive system like GPS, but it is 
based on the popular 802.11 technology and uses radio signal strength to locate. 
The precision of RADAR is in the range of 2 or 3 meters. 

2.3 Migratory Execution 

Spatial Views is part of the Smart Messages project[19, 2]. The goal of Spatial 
Views is to build a high-level space-aware programming language over Smart 
Messages. We had a simple implementation of the migratory execution feature 
of Smart Messages for rapid prototyping and evaluation of Spatial Views. 

Migratory execution has been extensively studied in the literature, especially 
in the context of mobile agents[20, 21]. However, Spatial Views only supports im- 
plicit transparent migration hidden in its iteration operation, and names a node 
based on the services that it provides. Spatial Views/Smart Messages is different 
from mobile agents in terms of the design goal. We are designing a programming 
tool and infrastructure for cooperative computing on networks of embedded sys- 
tems. The major network connection is assumed wireless. Spatial Views/Smart 
Messages uses content naming, and a migrating program is responsible for its 
own routing. 
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3 Programming Model 

To program a network of embedded system in Spatial Views, a programmer 
specifies the nodes in which he or she is interested based on the properties of 
the nodes. Then he or she specifies the task to be executed on those nodes. The 
properties used to identify interesting nodes include the services of the nodes 
and their locations. 

A program starts running on one node. Whenever it needs some services 
which the current node does not provide, it discovers another node that does, 
and migrates there to continue its execution. 

Spatial Views provides necessary programming abstractions and constructs 
for this novel programming model. Node discovery, ad hoc network routing, and 
execution migration are transparently implemented by the compiler, runtime 
system, and the operating system. A programmer is freed from dealing directly 
with the dynamic network. Figure 1 shows an example of Spatial Views program. 
We will walk through this example in Section 3.3. 

3.1 Services and Virtual Nodes 

NES computing is cooperative computing [19]. Nodes participate in a common 
computing task by providing some service and using services provided by other 
nodes at the same time. A service is described or named with an interface in 
Spatial Views. Nodes provide services which are discovered at run-time, and are 
provided as objects implementing certain interfaces. In our programming model, 
discovery is assumed a basic function provided by the underlying middleware or 
OS. But we provide a simple discovery implementation based on the “random 
walk” technique in Section 4. The discovery procedure looks for nodes hosting 
classes implementing the interface. When such a node is found, an object of the 
class is created. The program is then able to use the service through the object. 
The discovery may be confined to certain physical space as we will discuss in 
Section 3.2. 

The basic programming abstraction in Spatial Views is a virtual node, which 
is denoted as a pair (service, location), representing a physical node providing the 
service and locating in the location. Concrete physical nodes with IP addresses or 
MAC addresses are replaced by virtual nodes. Depending on how many services 
it provides, a single physical node may be represented by multiple virtual nodes. 
More interestingly, if a physical node is mobile, it may be used as different virtual 
nodes at different points during the application execution. Uniquely identifying 
a particular physical node is not supported in Spatial Views. In case that an 
application needs to do so, the programmer can use some application-specific 
mechanism, for example, using MAC addresses. 

3.2 Spatial Views, Iterators and Selectors 

A spatial view is a dynamic collections of virtual nodes that provide a common 
service and locate in a common space. Here a space is a set of locations, which 
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can be a room, a floor, or a parking lot. Iterators and selectors describe actions 
to be performed over the nodes in a view. The instructions specified in the body 
of an iterator are executed on “all” or as many nodes as possible of the view. In 
contrast, the body of a selector is executed on only one node in the view if the 
view is not empty. 

The most important characteristics of a spatial view is its dynamic nature. 
It is a changing set of virtual nodes. A physical node may move out, or run out 
of power. So a virtual node may just disappear at an arbitrary point. On the 
other hand, new nodes may join at any time. For this reason, two consecutive 
invocations of the same iterator over the same view may lead to different results. 
A spatial view is defined as follows: 

SpatialViewDefinition — * 

SpatialView SV-id = new SpatialViewC Service , Space op t ) 

where Service is the name of an interface and Space is the space of interest. 
If the space is omitted, any node providing the interesting service would be 
included in the view no matter where it is. 

A spatial view is accessed through an iterator or selector. 

Iterator — > 

foreach node-id in SV-id do Time Constraint Constraint Li st op t Statement 
Selector — > 

forany node-id in SV-id do TimeConstraint ConstraintList op t Statement 

TimeConstraint — > 

within NumberOfMilliseconds 

TimeConstraint gives a time constraint, which is mandatory. ConstraintList 
gives a list of constraints on energy, monetary or other resources to apply to an 
iterator or a selector. At this point, only time constraints are supported. 

A time constraint demands an iterator or selector finish in NumberOfMil- 
liseconds. Time constraints are enforced following a best-effort semantics with 
the iteration body as the minimal atomic unit of constraint control. This means 
an iteration will never be partially executed even when a time constraint is vio- 
lated. A time constraint in Spatial Views is a soft deadline, and is a time budget 
rather than a real-time deadline. In other words, the time constraint does not 
ensure that a program terminates successfully within the deadline, but ensures 
no further execution after the budget is exhausted. 



3.3 Example 

The example shown in Figure 1 illustrates a Spatial Views application that 
executes on a network that contains nodes with cameras and nodes that provide 
image processing services such as human face detection [22]. The program tries 
to find a person with a red shirt or sweater on the third floor of a building. 
An answer is expected back within 30 seconds (soft deadline). A time limit is 
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0: // Import space definitions. 

1: import SpaceDefinition.Rutgers.*; 

2 : 

3: public class SVExample { 

4: public static void main(String args[]) { 

5: // Define a'Spatial View of cameras on the 3rd floor of the CoRE building 

6: SpatialView cameraView = new Spat ialViewC Camera" , Bus chCampus. CoRE. 3rdFloor ) ; 

7: Location loc; 

8: 

9 : // Iterate over camera view in 30 seconds 

10: foreach camera in cameraView do within 30000 { 

11: Picture pic = camera.getPictureO ; 

12: Rectangle redRegion = pic .f indRegionlnColor (Color .Red) ; 

13: 

14 : if (redRegion ! = null) { 

15: // Define a~Spatial View of face detection services. The default space is anywhere. 

16: SpatialView detectorView = new SpatialView("FaceDetector") ; 

17: Rectangle face; 

18: 

19: // select a~detector and finish face detection in 10 seconds 

20: forany detector in detectorView do within 10000 

21: face = detector .detectFacelnPicture (pic) ; 

22 : 

23: // Check if the the red region is close to the face so that we can think it is a"person in red 

24: if (face != null && face . isCloseTo (redRegion) ) 

25: loc = camera. getLocationO ; 

26: > 

27: > 

28: 

29: if (loc ! =null) 

30: System. out. printlnC'A person in red is found at " + loc); 

31: else 

32: System. out. println("No one in red is found."); 

33: > 

34: > 



Fig. 1. Spatial Views example application of locating a person in red 



necessary because the computed answer may become “stale” if returned too late 
(the missing person may have left the building at the time the successful search 
result is reported). 

Static physical spaces such as buildings and floors within buildings may be 
defined as part of a Spatial Views space library. In the example, we assume 
that the package “SpaceDefinition.Rutgers.*” contains such definitions for the 
Rutgers University campuses. 

Line 6 defines a spatial view of cameras on the third floor of a building 
named CoRE (a building at Rutgers University.) Lines 10-27 define the task 
to be performed on the cameras in the spatial view defined in line 6. It is an 
iterator, so the task will be executed on each camera discovered within the time 
constraint, 30 seconds as defined in line 10. When the execution reaches Line 
11, the program would have migrated to a camera. Then a picture is taken. Line 
12 tries to find a region in the picture that is mostly red. If such a red region 
is found, another spatial view consisting of face detectors is defined (Line 16.) 
Lines 20 and 21 use a face detector in the view defined in Line 16 to find a face 
in the picture. (Because it is a selector, line 20 and 21 finishes right after the 
first face detector is discovered.) If the face detected is close to the red region in 
the picture, the program concludes it is a person in a red shirt, and remember 
the location of the camera which takes the picture. This location is reported at 
the end of the program (Lines 29-32.) 
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Fig. 2. Compilation of Spatial Views Programs 
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Fig. 3. Architecture of a Node 



4 Implementation 

The implementation itself is not the major contribution of this paper. The pro- 
gramming model is. The purpose of this implementation is to justify the pro- 
gramming model, and to provide an opportunity to study the abstractions and 
constructs proposed in the model. It is part of our on-going work to make this 
implementation faster, scalable, secure and economic acceptable. However, the 
current implementation has shown the feasibility of our programming model. 

Our prototype is an extension to Java 2 Platform, Micro Edition (J2ME)[23]. 
Figure 2 shows the basic structure of the Spatial Views compilation system. 
We are currently investigating optimization passes that improve the chances of 
a successful program execution in a highly volatile target network. The compiled 
bytecode runs on a network, each node of which has a Spatial Views virtual 
machine and a Spatial Views runtime library. Figure 3 shows the architecture of 
a single node. 

We build the Spatial Views compiler, virtual machine and runtime library 
based on Sun’s J2ME technology [23]. J2ME is a Java runtime environment tar- 
geting extremely tiny commodities. KVM[24] is a key part of J2ME. It is a virtual 
machine designed for small-memory, limited-resource and networked devices like 
cell phones, which typically contain 16- or 32-bit processors and a minimum 
memory of about 128 kilobytes. 

We modified javac in Java 2 SDK 1.3.1 to support the new Spatial Views 
language structures, including the foreach and forany statement and space 
definition statements. We modified the KVM 1.0.3 to support transparent pro- 
cess migration. And we extended CLDC 1.0.3 with new system classes to support 
Spatial Views language features. We ported our implementation to x86 and ARM 
architectures, running Linux 2.4.x. 
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4.1 Spatial Views Iteration and Selection 

At the beginning of an iteration, a new thread is created to discover interesting 
nodes and to migrate the process there. We call the new thread Bus Thread. The 
Bus Thread implements a certain discovery/routing algorithm and respects the 
user-specified constraints. 

The Bus Thread migrates from one interesting node to another. An interest- 
ing node is a node that provides the service and is located in the space specified in 
the spatial view definition. On such a node, the Bus Thread blocks and switches 
to the user task thread, the code of which is specified in the iteration body. 
When an iteration step finishes, the user task thread blocks and switches back 
to the Bus Thread. The Bus Thread continues until no more interesting nodes 
can be found or the time budget is used out. In the case of selectors, the Bus 
Thread finishes right after the first interesting nodes is found. When the Bus 
Thread finishes, the corresponding spatial views iteration ends. The Bus Thread 
is like a bus carrying passengers (user task threads in our case), running across 
a region and stopping at certain interesting places, hence the name. 

This implementation with a Bus Thread provides a simple framework to 
iterate a spatial view as a dynamic set of interesting nodes. Node discovery is 
transparent to the programmer and performed by the underlying middleware or 
by the OS using existing or customized discovery and routing algorithms. 

Such a framework does not limit the search algorithm a program uses to dis- 
cover an interesting node. In the current implementation, we use “random walk” 
technique, which randomly picks a neighbor of the current node and migrates 
there. On each node the bus thread checks for the service and location. If the 
interesting service is found in the specified space, it switches to user task. The 
Bus Thread remembers the nodes that it has visited by recording their ID’s (e.g. 
IP addresses and port numbers) and avoid visiting them again. 

Such an algorithm may be slow and not scalable, but one can hardly do better 
in an unstructured, dynamic network. However, if the network is not changing 
very fast or not changing at all, a static directory of services can be maintained to 
find interesting nodes. Another possible improvement is to allow the Bus Thread 
to clone itself and search the network in parallel. This optimization is currently 
under investigation. 

As to the constraints, so far we have implemented the time constraint. The 
Bus Thread times each single iteration step, and checks the remained time budget 
after each single iteration step finishes. If the budget drops below zero, the 
iteration is stopped. So the time constraint is a soft deadline implemented with 
“best-effort” semantics. This soft deadline provides effective trade-offs between 
quality-of-results and time consumption as shown in section 5.3. 



4.2 Transparent Process Migration 

Transparent process migration is implemented as a native method, migrate, in 
a Spatial Views system class. It is used in the implementation of foreach and 
f orany operations, migrate takes the destination node address as its parameter. 
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When migrate is called, the Spatial Views KVM sends the whole heap to the 
destination, as well as the virtual machine status, including the thread queue, 
instruction counter, the execution stack pointer and other information. 

The KVM running on the destination node receives the heap contents and 
the KVM status and starts a new process. Instead of ordinary process initial- 
ization, the receiving KVM populates its heap with the contents received from 
the network and adjusts its registers and data structures with the KVM status 
received from the network. To make migrate more efficient, we enforce a garbage 
collection before each migration. 

5 Experiments 

We used 10 Compaq iPAQ PDA’s (Model H3700 and H3800) as our test bed, 
2 of which are equipped with camera sleeves developed as part of the Mercury 
project at HP Cambridge Research Laboratory (CRL) (Figure 4(a)). The iPAQ’s 
were connected via 802.11b wireless technology. Since we had not implemented 
a location service based on GPS or other location technology, all node locations 
were statically configured in these experiments. 



5.1 Application Example 

We implemented the person search application discussed in Section 3.3. We timed 
the execution of the application on 10 iPAQ PDA’s connected by a 802.11b 
wireless network. The network topology is shown in Figure 4(b). 1 

Node “i” and “j” have cameras, shown as dark gray pentagons in the figure; 
node “b”, ”c”, and “f” provide the face detection service, shown as light gray 
triangles in the figure. The program starts from node “a” and eventually visits 
all the nodes in the network in the depth- first order. Once it finds a node with 
a camera, it takes a picture and checks if there is a red region in the picture. 
If there is, the program will look for a node providing face detection service. It 
stops on the first node with the service and looks for a face in the picture. If a 
face is found, and it is close to the red region in the picture, the program records 
the location where the picture is taken. Once the program finishes all the nodes, 
it migrates back to the starting node. 

We experimented with two situations. Situation 1: A red region is detected 
on both node “i” and “j”, but a face is found only in the picture from node 
“j”. Situation 2: No red region is detected on either node “i” or “j”, so no face 
detection is triggered. We timed the executions in both situations. The program 
took on average 23.1 seconds in situation 1 and 10.0 seconds in situation 2. In 
both cases, the time constraint was not violated. It is important to note that 
all the iPAQ’s use SA-1100 StrongARM processors running at 206MHz. But the 
nodes that provide face detection service offload the face detection computation 



1 In this paper, “network topology” refers to the network topology observed by one 
program execution. Another execution is very likely to observe a different topology, 
because the network is changing. 
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(a) Mercury Backpaq 




Fig. 4. 



to a PC. The execution times for the first situation was dramatically reduced as 
suggested in [25]. 

5.2 One-Hop Migration Time 

To assess the efficiency of execution migration, we measured the one-hop migra- 
tion time. We measured the overall execution time of two consecutive migrations, 
one migrating to a neighbor, followed by another one migrating back. The time 
taken by those two consecutive migrations is the round-trip time for one-hop mi- 
gration, which is twice the migration time. We measured the time for different 
live data size (The heap size is 128KB, but only live data are transfered.) The 
result is shown in Figure 5 using a wired (100Mbps Ethernet) and a wireless 
(11Mbps 802.11b) connection. 

In the KVM heap, there is a permanent space which is not garbage collectible. 
For our test program, the size of the permanent space is 65KB (66560 bytes). 
The contents of the permanent space include Java system classes, which are 
available on all the nodes, and strings, most of which are used only once in 
a program. The current implementation transfers the entire permanent space 
in a migration operation. We are making efforts to avoid this, which we expect 
would significantly speed up migration. 



5.3 Effects of Timeout Constraints 

To evaluate the effects of timeout constraints, we fake failures with certain prob- 
abilities for the network links. The test program iterates over “temperatures 
sensors” and reads the temperatures to calculate the average temperature. After 
finishing on each node, the program tries to connect to a neighbor. If none of 
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One-hop Migration Time 




Fig. 5. One-Hop Migration Time 




(a) (b) 

Fig. 6. Topologies for Experiment on Timeout Constraint 



the neighbors is reachable, the program waits for 10ms and tries again. And it 
keeps trying until it successfully migrates to a neighbor. 

If the network link failure probability is high, the iteration time might be 
very long. In that case, the timeout constraints can significantly reduce the 
iteration time and still get some result. We did the experiments with two different 
topologies shown in Figure 6(a) and 6(b), with the experimental results shown 
in Figure 7. 

The time to wait before a successful migration is 10ms x — 1_, where p is 
the probability that all the links of a node to its neighbors fail. In Topology 
(a), p = pi, where pi is the failure probability of a single link. In Topology (b), 
p = pf. Then the time of a single iteration step is 10ms x + 400ms, where 
400ms is the maximum one-hop migration time(see Figure 5). 

If no time constraint is imposed, the expected execution time is (n — 1) x 
10ms x + (n — 1) x 400ms, where n is the number of nodes visited. We omit 
the task execution time on each node, because the temperature reading is so fast 
that the time it takes is much less than migration and waiting time. 

If a timeout ttimeout is specified, the expected program execution time will 
be < ttimeout + 10ms x _|_ 400ms. For Topology (a), link failure probabil- 
ity Pi = 98% and ttimeout =2200ms, that upper bound is 3100ms , which is verified 
by the experimental result, 3095ms (see Figure 7(a)). For Topology (b), pi=98%, 
and t t ime OU t= 1200ms, that upper bound is 1850ms. which is also verified by the 
experimental result, 1802ms (see Figure 7(b)). 
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(a) Iteration Time on Topology (a) (b) Iteration Time on Topology (b) 




Network Link Failure Probability (%) Network Link Failure Probability (%) 



(c) Number of Nodes Visited with Time- (d) Number of Nodes Visited with Time- 
out Constraint (Topology (a)) out Constraint (Topology (b)) 



Fig. 7. Effects of Timeout 



Using time constraints, a programmer is able to keep a decent quality of 
result of the program, while significantly reducing the execution time. Instead of 
producing no answer (as it happens when a user presses “Ctrl-C” in a traditional 
programming environment,) the program reports a result of reduced quality (e.g. 
only two temperature readings.) when the time budget is used out. The number 
of nodes visited in our experiments, as the criterion for quality of result, is shown 
in Figure 7(c) and 7(d). 

6 Conclusion 

Spatial Views is a programming model that allows the specification of programs 
to be executed on dynamic and resource-limited networks of embedded systems. 
In such environments, the physical location of nodes is crucial. Spatial Views 
allows a user to specify a virtual network based on common node characteristics 
and location. Nodes in such a virtual network can be visited using an iterator or 
selector. Execution migration, node discovery, or routing is done transparently. 
Time and other resource constraints allow the programmer to express quality of 
result trade-offs and to manage the inherent volatility of the underlying network. 

The Spatial Views programming model is simple and expressive. A prototype 
of Spatial Views including a compiler, a runtime library and a virtual machine, 
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has been implemented as an extension to J2ME. Experimental results on a net- 
work of up to 10 iPAQ’s handheld computers running Linux are very encouraging 
for a person search application. In addition, the effectiveness of time constraints 
to allow graceful degradation of the quality of a program’s answer was experi- 
mentally evaluated for a temperature sensor network with two different network 
topologies. 

Spatial Views is the first spatial programming models with a best-effort 
semantics. The model allows optimization such as parallelization (multiple 
threads), and quality of result vs. resources usage trade-offs. 
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Abstract. Compilers have long used redundancy removal to improve 
program execution speed. For handheld devices, redundancy removal is 
particularly attractive because it improves execution speed and energy 
efficiency at the same time. In a broad view, redundancy exists in many 
different forms, e.g., redundant computations and redundant branches. 
We briefly describe our recent efforts to expand the scope of redundancy 
removal. We attain computation reuse by replacing a code segment by 
a table look-up. We use IF-merging to merge conditional statements 
into a single conditional statement. We present part of our preliminary 
experimental results from an HP/Compaq iPAQ PDA. 



1 Introduction 

Compilers have long used redundancy removal to improve program execution 
speed. For handheld devices, which have limited energy resource, redundancy 
removal is particularly attractive because it improves execution speed and en- 
ergy efficiency at the same time. In a broad sense, any reuse of a previous result 
can be viewed as a form of redundancy removal. Recently, our research group 
has investigated methods to expand the scope of redundancy removal. The in- 
vestigation has resulted in two forms of operation reuse, namely computation 
reuse and branch reuse. 

Computation reuse can be viewed as an extension of common subexpression 
elimination (CSE). CSE looks for redundancy among expressions in different 
places of the program. Each of such expressions computes a single value. In 
contrast, computation reuse looks for redundancy among different instances of 
a code segment or several code segments which perform the same sequence of 
operations. In this paper, we shall discuss computation reuse for a single code 
segment which exploits value locality [1, 2, 3, 4] via pure software means. 

We exploit branch reuse through an IF-merging technique which reduces the 
number of conditional branches executed at run time. This technique does not 
require special hardware support and thus, unlike hardware techniques, it does 
not increase the power rate. The merger candidates include IF statements which 
have identical or similar IF conditions which nonetheless are separated by other 
statements. The idea of IF-merging can be implemented with various degrees of 
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aggressiveness : the basic scheme, a more aggressive scheme to allow nonidentical 
IF conditions, and lastly, a scheme based on path profiling information. In the 
next two sections, we discuss these techniques respectively and compare each 
technique with related work. We make a conclusion in the last section. 

2 Computation Reuse 

Recent research has shown that programs often exhibit value locality [1, 2, 3, 4], 
a phenomenon in which a small number of values appear repeatedly in the same 
register or the same memory location. A number of hardware techniques [5, 6, 7, 
1, 2, 8, 9, 4] have been proposed to exploit value locality by recording the inputs 
and outputs of a code segment in a reuse table implemented in the hardware. 
The code segment can be as short as a single instruction. A subsequent instance 
of the code segment can be simplified to a table look-up if the input has appeared 
before. 

The hardware techniques require a nontrivial change to the processor design, 
typically by adding a special buffer which may contain one to sixteen entries. 
Each entry records an input (which may consist of several different variables) 
and its matching output. Such a special buffer increases the hardware design 
complexity and the hardware cost, and it remains unclear whether the cost is 
justified for embedded systems and handheld computing devices. Using a soft- 
ware scheme, the table size can be much more flexible, although table look-up 
will take more time. The benefit and the overhead must be weighed carefully. 

In our scheme, we use a series of filtering to identify stateless code segments 
which are good candidates for computation reuse. Figure 1 shows the main 
steps of our compiler scheme. For each selected code segment, the scheme cre- 
ates a hashing table to continuously record the inputs and the matching outputs 
of the code segment. Based on factors such as value repetition rate, computa- 
tion granularity estimation, and hashing complexity, we develop a formula to 
estimate whether the table look-up will cost less than repeating the execution. 
The hashing complexity depends on the hash function and the input/output size. 
The hashing table can be as large as the number of different input patterns. This 
offers opportunities to reuse computation whose inputs and outputs do not fit 
in a special hardware buffer. 

2.1 How to Reuse 

Computation reuse is applied to a stateless code segment whose output de- 
pends entirely on its input variables, i.e. variables and array elements which 
have upwardly-exposed reads in the segment. The output variables are identified 
by liveness analysis. A variable computed by the code segment is an output vari- 
able if it remains live at the exit of the code segment. If we create a look-up 
hash table for the code segment, the input variables will form the hash key. An 
invariant never needs to be included in the hash key. Therefore, for convenience, 
we exclude invariants from the set of input variables. 
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Fig. 1 . Frame-work of the compiler scheme 



The code segment shown in Figure 2(a) has an input variable val which is 
upwardly exposed to the entry of function quan. The array power2 is assumed 
to be invariant. The output variable is integer i which remains live at the exit 
of the function. 

Our scheme collects information on three factors which determine the perfor- 
mance gain or loss from computation reuse, namely the computation granularity , 
the hashing overhead , and the input reuse rate of the given code segment. With 
the execution-frequency profiling information, it is relatively easy to estimate the 
computation granularity defined as the number of operations performed by the 
code segment. To get the reuse rate, we estimate the number (Nd s ) of distinct 
sets of input values by value profiling and the number ( N ) of instances the code 
segment executed. We define the reuse rate p by the following equation: 




Based on the inputs and the outputs of the candidate code segment, we esti- 
mate the overhead of hashing table for computation reuse. The hashing overhead 
depends mainly on the complexity of the hash function and the size of each set 
of inputs and outputs. 

To produce a hash key for each code segment, we first define an order among 
the input variables. The bit pattern of each input value forms a part of the 
key. In the case of multiple input values, the key is composed by concatenating 
multiple bit strings. In common cases, the hash key can be quite simple. For 
example, the input of the code segment in Figure 2(a) is an integer scalar, so 
the hash key is simply the value of the input. The hash index can simply be the 
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int quan( int val ) { 
int i; 

for ( i = 0; i < 15; i++ ) { 
if ( val < power2[i] ) 
break; 

} 

return (i); 

} 

(a) 

Fig. 2. An example code segment and i 
reuse 



int quan(int val ) { 
int i, key; 

if ( check_hash(val, hash_table, &key) == 0 ) { 
for (i = 0; i < 15; i++) { 
if (val < power2[i]) 
break; 

} 

hash_table[key] = i; 

} 

else { 

i = hash_table[key]; 

} 

return (i); 

> 

(b) 

transformation by applying computation 



hash key modularized by the hash size. Figure 2(b) shows the transformation 
result of the code segment in Figure 2(a). 

The hashing overhead depends on the size of the input and the output. The 
time to determine whether we have a hit is proportional to the size of the input. 
For a hit, the recorded output values should be copied to the corresponding 
output variables. For a miss, the computed output values must be recorded in 
the hashing table. In both cases, the cost of copying is proportional to the size of 
the output. In our scheme, we count the numbers of extra operations performed 
during a hit or a miss. (Note that a hit or a miss has the same number of extra 
operations.) A hashing collision can increase the hashing overhead. However, we 
assume there exist no hashing collisions. 

2.2 Cost-Benefit Analysis 

For a specific code segment, suppose we know the computation granularity C, 
the hashing overhead O, and the reuse rate p. The cost of computation before 
transformation equals C . The new cost of computation with computation reuse 
is specified by formula (1) below. Our scheme checks to see whether the gain by 
applying computation reuse, defined by formula (2), is positive or negative. 

(C + 0)-(l-p) + 0-p (1) 

C-[(C + O)-(l-p) + 0-p] = P-C-O (2) 

p ■ C - O > 0 or p >77 (3) 

In the above, computation reuse improve performance for the specific code seg- 
ment if and only if the condition in formula (3) is satisfied. Obviously, reuse 
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rate p can never be greater than 1. This gives us another criteria to filter out 
code segments so as to reduce the complexity of value-set profiling. The com- 
piler scheme removes code segments which do not satisfy 0 < 1 from further 
consideration. For the remaining code segments, value profiling is performed to 
get p. 

After we obtain p, the compiler picks the code segments which satisfy formula 
(3) for computation reuse. Such code segments are transformed into codes that 
perform table look-up. 

2.3 Value-Set Profiling 

Our scheme requires information on the reuse rate p which measures the repet- 
itiveness of a set of input values for a code segment. This is in contrast to 
single- variable value profiling [10], where one can record the number of different 
values of the variable written by an instruction during the program execution. 
The ratio of this number over the total number of execution of the instruction 
defines the value locality at the instruction. (The lower the ratio, the higher 
the locality.) The locality of a set of values, unfortunately, cannot be directly 
derived based upon the locality of the member values. For example, suppose x 
and y each has two distinct values. The set of ( x , y ) may have two, three, or 
four distinct value combinations. 

Therefore, our scheme first needs to define code segments for which we con- 
duct value-set profiling. Given such a code segment, profiling code stubs can be 
inserted to record its distinct sets of input values. If we indiscriminately perform 
such value-set profiling for all possible code segments, the profiling cost will be 
prohibitive. To limit such cost, we confine the code segments of interest to those 
frequently executed routines, loops and IF branchs. Such frequency information 
is available by well-known tools such as gprof and gcov. 

2.4 Experimental Results 

We use Compaq’s iPAQ 3650 for the experiments. The iPAQ 3650 has a 206MHZ 
Intel StrongArm SA1110 processor [11] and 32MB RAM, and it has 16KB in- 
struction cache and 8KB data cache both 32 way set-associative. To test the 
energy consumption on the handheld device, we connect an HP 3458a high pre- 
cision digital multi-meter to measure the actual current drawn on the handheld 
computer during the program execution. 

We have experimented with six multimedia programs from Mediabench [12] 
and the GNU Go game. In our experiments, we use the default input parameters 
and input files as specified on the Mediabench web-site. The results from these 
programs are described below. 

The two programs, G721_encode and G721_decode perform voice compression 
and decompression, respectively, based on the G.721 standard. They both call a 
function quan which have a computation reuse rate of over 99%. 
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Table 1. Performance Improvement by Computation Reuse 



Programs 


Original (s) 


Computation Reuse (s) 


Speedup 


G721_encode 


2.01 


1.53 


1.31 


G721_decode 


3.69 


2.76 


1.34 


MPEG2_encode 


120.63 


113.30 


1.06 


MPEG2_decode 


83.02 


46.06 


1.80 


RASTA 


14.92 


12.66 


1.18 


UNEPIC 


1.73 


0.76 


2.28 


GNUGO 


788.05 


654.51 


1.20 


Harmonic Mean 






1.37 



The programs MPEG2_encode and MPEG2_decode encode and decode, re- 
spectively, MPEG data. Our scheme identifies the function fdct for computation 
reuse in MPEG2_encode and the function ReferenceADCT in MPEG2_decode. 

RASTA , which implements front-end algorithms of speech recognition, is a 
program for the rasta-plp processing. Its most time-consuming function FRATR 
contains a code segment with one input variable and six output variables. The 
input repetition rate is 99.6%. 

UNEPIC is an image decompression program. Its main function contains 
a loop to which our compiler scheme is applied. The loop body has a single input 
variable and a single output variable, both integers. The input has a repetition 
rate of 65.1%. 

GNU Go is a go game. In our experiments, we use the input parameters “-b 6 
-r 2” , where “-b 6” means playing 6 steps in benchmark mode and “-r 2” means 
setting the random seeds as 2 (to make it easier to verify results) . The function 
accumulateJnfluence contains eight code segments for computation reuse and 
the average repetition rate of inputs is 98.2%. 

Tables 1 and 2 compare the performance and energy consumption, respec- 
tively, before and after the transformation. The machine codes (both before and 
after our transformations) are generated by GCC compiler (pocket Linux ver- 
sion) with the most aggressive optimizations (03). The energy is measured in 
Joules (J). 



Table 2. Energy Saving by Computation Reuse 



Programs 


Original (J) 


Computation Reuse (J) 


Energy Saving 


G721_encode 


4.59 


3.56 


22.4% 


G721_decode 


8.43 


6.47 


23.3% 


MPEG2_encode 


281.67 


265.12 


5.9% 


MPEG2_decode 


193.85 


108.01 


44.3% 


RASTA 


36.60 


31.02 


15.2% 


UNEPIC 


4.03 


1.81 


55.1% 


GNUGO 


1936.23 


1613.69 


16.7% 
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Table 3. Performance Improvement for Different Input Files 



Programs 


Sources of Inputs 


Original (s) 


Computation Reuse 

(s) 


Speedup 


G721_encode 


MiBench 


9.12 


6.77 


1.35 


G721_decode 


MiBench 


8.60 


6.32 


1.36 


MPEG2_encode 


Tektronix(table tennis) 


175.36 


147.47 


1.19 


MPEG2_decode 


Tektronix(table tennis) 


139.32 


94.37 


1.48 


RASTA 


ICSI(rasta_testsuite_1998) 


37.87 


31.98 


1.18 


UNEPIC 


EPIC web-site(baboon.tif) 


7.26 


1.71 


4.25 


GNUGO 


“-b 9 -r 2” 


1485.28 


1236.96 


1.20 


Harmonic Mean 








1.43 



Since our computation reuse scheme is based on profiling, we test the effec- 
tiveness of the scheme with different input files. The program transformation 
is based on the profiling with default input files from the Mediabcnch web-site, 
and we run the transformed programs with other different input files. We show 
the results in Table 3. GNU Go has no input files, and we change the parameter 
from 6-step to 9-step. For each other program, we arbitrarily collect one input file 
from Internet or other benchmark suite such as MiBench [13]. We list the sources 
of input files in the second column of Table 3. For G721, we choose the input file 
small.pcm from the MiBench program ADPCM. We select the tensJ)15.m2v, 
which plays table tennis, from Tektronix web-site, and extract the first 6 frames 
as the input of MPEG2 encode and decode. For RASTA , we choose the input 
file phone.pcmbe.wav in 1998’s RASTA test suite from ICSI. For UNEPIC, 
we get the input file baboon.tif of EPIC, and we generate its UN EPIC input 
file by running EPIC with the baboon.tif as input. The last column of Table 3 
shows the effectiveness of our scheme. Based on the profiling information with 
the default input files, these programs applied the computation reuse scheme 
can achieve substantial performance improvement for other different input files. 

2.5 Related Work 

Since Michie introduced the concept of memoization [8], the idea of computation 
reuse had been used mainly in the context of declarative languages until the early 
90’s. In the past decade, many researchers have applied this concept to reuse the 
intermediate computation results of previously executed instructions [5, 6, 7, 1, 2, 
9, 4]. Richardson applies computation reuse to two applications by recording the 
previous computation results in a result cache [9]. However, he does not specify 
how the technique was implemented, and the result cache in his paper is a special 
hardware cache. Sodani and Sohi [4] propose an instruction reuse method. The 
performance improvement of instruction level reuse is not significant, due to 
the small reuse granularity [14]. In the block and sub-block reuse schemes [1, 2], 
hardware mechanisms are proposed to exploit computation reuse in a basic block 
or sub-block. The reuse granularity on basic block level seems still too small, and 
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the hardware needs to handle a large number of basic blocks for computation 
reuse. 

Connors and Hwu propose a hybrid technique [7] which combines software 
and hardware for reusing the intermediate computation results of code regions. 
The compiler identifies the candidate code segments with value profiling. During 
execution, the computation results of these reusable code regions are recorded 
into hardware buffers for potential reuse. Their compiler analysis can identify 
large reuse code regions and feed the analysis results to the hardware through 
an extended instruction set architecture. In the design of the hardware buffer, 
they limit the buffer size to 8 entries for each code segment. 

3 IF-Merging 

Modern microprocessors use deep instruction pipelining to increase the number 
of processed instructions per clock cycle. Branch instructions, however, tend to 
degrade the efficiency of deep pipelining. Further, conditional branches reduce 
the size of basic blocks, introduce control dependences between instructions, 
and hence may hamper the compiler’s ability to perform code improvement 
techniques such as redundancy removal, software pipelining, and so on [15, 16, 

17]- 

To reduce the penalty due to branch instructions, researchers have proposed 
many techniques, including static and dynamic branch prediction [18, 19], predi- 
cated execution [20, 21], branch reordering [17], branch alignment [22] and branch 
elimination [15, 16, 23], etc. Among these, branch prediction, especially dynamic 
branch prediction, has been extensively studied and widely used in modern high- 
performance microarchitectures. Branch prediction predicts the outcome of the 
branch in advance so that the instruction at the target address can be fetched 
without delay. However, if the prediction is incorrect, the instructions fetched 
after the branch have to be squashed. This situation results in a waste of CPU 
cycles and power consumption. Hence, a high prediction rate is critical to the 
performance of high-performance microprocessors. To achieve a high prediction 
rate, almost all high-performance microprocessors today employ some form of 
hardware support for dynamic branch prediction. 

In contrast, processors designed for power-aware systems, such as mobile 
wireless computing and embedded systems, must take both the program speed 
and the power consumption into consideration. The concern for the latter may 
often be greater than for the former on many platforms. A branch predictor 
dissipates a non-trivial amount of power, which can be 10% or higher of the total 
processor’s power dissipation. Such a predictor, therefore, may not be found on 
microprocessors have more stringent power constraints [24]. 

Hardware support for predicated execution [25] of instructions has been used 
on certain microprocessors, such as Intel XScale. Predicated execution removes 
forward conditional branches by attaching flags to instructions. The instructions 
are always fetched and decoded. But if the predicate evaluates to false, then 
a predicated instruction does not commit. Obviously, the effectiveness of predi- 
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if ( sign ) { 
diff = -diff; 

1 



if ( sign ) 

valpred -= vpdiff; 

else 

valpred += vpdiff; 



if ( sign ) { 
diff = -diff; 



valpred -= vpdiff; 

} 

esle { 



valpred += vpdiff; 

1 



(a) 



(b) 



Fig. 3. An example code shows opportunity of basic IF-merging 



cated execution highly depends on the rate at which the predicates evaluate to 
true. If the rate is low, then the waste in CPU cycles and power can be rather 
high. 

It is also worth noting that branch prediction, as a run-time technique, gen- 
erally does not help enhance the compiler’s ability to improve codes. Recently 
proposed speculative load/store instructions expose the control of speculative ex- 
ecution to the software, which may increase the compiler’s ability to pursue more 
aggressive code improvement techniques [26]. However, by today’s technology, 
hardware support for speculative execution tends to increase power consumption 
considerably. Therefore, such support is not available on microprocessors which 
have more stringent power constraints. 

In order to reduce the number of conditional branches executed at run time, 
we perform a source-level program transformation called IF-merging. This tech- 
nique does not require special hardware support and it does not increase the 
power rate. Using this technique, the compiler identifies IF statements which 
can be merged to increase the size of the basic blocks, such that more instruc- 
tion level parallelism (ILP) may be exposed to the compiler backend and, at run 
time, fewer branch instructions are executed. The merger candidates include IF 
statements which have identical or similar IF conditions which nonetheless are 
separated by other statements. Programmers usually leave them as separate IF 
statements to make the program more readable. 

The idea of IF-merging can be implemented with various degrees of aggres- 
siveness: the basic scheme, a more aggressive scheme to allow nonidentical IF 
conditions, and lastly, a scheme based on path profiling information. 

3.1 A Basic IF-Merging Scheme 

In the basic scheme, we merge IF statements with identical IF conditions to 
reduce the number of branches and condition comparison. Figure 3(a) shows 
an example extracted from the Mediabench suite. In the example code, two IF 
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statements with identical condition are separated by other statements, which we 
call intermediate statements. Based on the data dependence information, we find 
that such intermediate statements have data dependences with the two merger 
candidates. Hence, we cannot move any of these intermediate statements before 
or after the new IF statement. We duplicate the intermediate statements and 
place one copy in the then-component of the merged IF statement, and another 
in the else-component. Figure 3(b) shows the transformed code by applying IF- 
merging on the code in Figure 3(a). 

Throughout this section, we assume the source program is structured. Thus 
we can view the function body as a tree of code segments, such that each node 
may represent a loop, a compound IF statement, a then-component, an else- 
component, or simply a block of assignment statements and function calls. The 
function body is the root of the tree. If a node A is the parent of another 
node B , then the code segment represented by B is nested in the code segment 
represented by A. Unless stated otherwise, the merger candidates must always 
have the common parent in such a tree. 

Obviously, in all of our IF-merging schemes, we need to be able to identify 
identical IF conditions, which requires symbolic analysis of IF conditions. To fa- 
cilitate such analysis, we perform alias analysis [27], global value numbering [28], 
and transform the program into static single assignment (SSA) form [29], such 
that variables with identical values can be clearly identified. We then apply 
a set of normalization rules to the predicate trees of IF conditions, including 
the sub-trees that represent the arithmetic expressions in those conditions. Such 
normalization rules and the ensuing symbolic comparisons have been discussed 
extensively in the literature of software engineering and parallelizing compilers. 



if (tmpl ==-32768 && tmp2==-32768) 
tmp2 = 32767; 
else 

tmp2 = OxOFFFF&((tmpl * 
tmp2 + 16384) » 15) ; 

if (tmpl ==-32768 && sri==-32768) 
tmpl =32767; 
else 

tmpl = OxOFFFF&((tmpl * 
sri + 16384) » 15) ; 

Original Code 



if (tmpl = -32768) { 
if (tmp2==-32768) 
tmp2 = 32767; 
else 

tmp2 = 0x0FFFF&((tmpl*tmp2+16384)»15); 
if (sri==-32768) 
tmpl = 32767; 
else 

tmpl = OxOFFFF&((tmpl*sri+16384)»15); 

} 

else { 

tmp2 = 0x0FFFF&((tmpl*tmp2+16384)»15); 
tmpl = 0x0FFFF&((tmpl*sri+16384)»15); 

} 

Transformed Code 



Fig. 4. Nonidentical conditions with common sub-predicates and its transformation 
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if ( a ) ( 


if ( a&& b ) 


SI; 


SI; 


1 


S2; 

\ 


if ( b ) { 


/ 

else if ( a ) 


S2; 


SI; 


1 


else if ( b ) 




S2; 



Original Code Transformed Code 

Fig. 5. Example code shows IF-merging with profiling 



3.2 IF-Condition Factoring 

The basic IF-merging scheme only identifies IF statements with identical IF con- 
ditions for IF-merging. Suppose the conditions are nonidentical but have com- 
mon sub-predicates. By factoring the conditions we can also reduce the number 
of branches. The left-hand side of Figure 4 shows an example code extracted 
from Mediabench, and the right-hand side of Figure 4 shows the transformed 
code. 

Our factoring scheme identifies IF statements with conditions containing 
common sub-predicates, and it factors the common sub-predicates from the con- 
ditions to construct a common IF statement, which encloses the original IF 
statements with the remaining sub-predicates as conditions. 

3.3 IF-Merging with Path Profiling 

With path profiling information [30], we can make the IF-merging technique 
even more aggressive. For example, in the case of the code in the left-hand side 
of Figure 5, if the path profiling shows that majority of executions go to both 
>ST and S 2, then we can transform the code into that showed in the right-hand 
side of Figure 5. 

We note that the probability of both taken in the two IF statements is p a b- 
If Pab is greater than 0.5, merging the two IF statements will reduce the number 
of branches. (The original code has two branches and the merged code has 1 + 
2 * (1 — p a b) < 2 branches.) The number of comparison operations (denoted by 
A) in the transformed code is defined by Formula (4) below, where p a is the 
probability of taken in the first IF statement. 

A = (1 +Pa) + (1 -Pab)0- + (1 -Pa)) (4) 

A — 3 2 Pab T PaPab 

=> A >3 — 2 Pab + Pa b => A > 2 + (1 — Pab) 2 => A>2 

A = 3 2 Pab T PaPab 

=> A = 3 - p ab (2 - Pa) => A < 3 — 0.5(2 — p a ) = 2 + 0.5p a => A < 2.5 

Hence, the number of comparison operations in the transformed code ranges 
from 2 to 2.5 when p a b is greater than 0.5. The original code has two comparison 



284 



Yonghua Ding and Zhiyuan Li 



if ( a ) { 
if(b) j 
SI; 



if (b) { 

if ( a ) { 
SI; 



else { 
S2; 



else { 
S3; 



else { 

S3; 

! 

Original code 

Fig. 6. Nested IF statements and 



else { /* !b => a */ 

S2; 

} 

Transformed code 

transformation of IF-exchanging 



operations. Although the number of condition comparisons is increased after 
merging, the performance has a net gain. Further, the then-component of the 
merged IF statement may present more opportunities for other optimizations. 

Another case for consideration is nested IF statements whose conditions are 
dependent. For example, the condition (or its negation) of the inner IF statement 
may derive the condition of the outer IF statement. (Obviously, the opposite is 
normally false. Otherwise we can remove the inner IF statement.) Given such 
nested IF statements, with profiling information on the taken probability, we can 
decide whether it benefits to exchange the nesting. Figure 6 shows an example 
code of nested IF statements in the left-hand side, and the code after the IF- 
exchange transformation in the right-hand side. In this example, we suppose the 
condition lb (the negation of b ) implies the condition a. (For example, suppose b 
is X > 0 and a is X < 100.) We further suppose that, based on profiling 
information, the taken probability of the outer IF statement ( p a ) is greater than 
that of the inner IF statement ( pb ). In the original code, both the number of 
branches and the number of comparison are 1 +p a , and in the transformed code, 
both of them are 1 + pb- Since p a is greater than pb, the IF-exchange will reduce 
both the number of branches and the number of comparison. 



3.4 Experimental Results 

We have experimented with eight multimedia programs from Mediabench [12]. 
Tables 4 and 5 show the performance and energy consumption, respectively, 
before and after IF-merging. The machine codes (both before and after our 
transformations) are generated by GCC (pocket Linux version) with the most 
aggressive optimizations (03). Due to the space limit, detailed explanations are 
omitted. 



3.5 Related Work 

To reduce branch cost, many branch reduction techniques have been proposed, 
which include branch reordering [17], conditional branch elimination [15, 23], 
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Table 4. Performance improvement by IF-Merging 



Programs 


Original (s) 


Optimized (s) 


Speedup 


ADPCM_coder 


0.0670 


0.0607 


1.104 


ADPCM_decoder 


0.0639 


0.0594 


1.076 


G721_encode 


2.01 


1.88 


1.069 


G721_decode 


3.69 


3.46 


1.066 


GSM_toast 


1.11 


1.04 


1.067 


GSM.untoast 


0.51 


0.47 


1.085 


PEG WIT .encrypt 


0.424 


0.412 


1.029 


PEG WIT .decrypt 


0.240 


0.236 


1.017 


Harmonic Mean 






1.063 



Table 5. Energy Saving by IF-Merging 



Programs 


Original (J) 


Optimized (J) 


Saving 


ADPCM.coder 


0.0324 


0.0294 


9.3% 


ADPCM.decoder 


0.0325 


0.0299 


8.0% 


G721_encode 


0.9403 


0.8855 


5.8% 


G721_decode 


1.7375 


1.6311 


6.1% 


GSM.toast 


0.5374 


0.5049 


6.0% 


GSM.untoast 


0.2426 


0.2228 


8.2% 


PEG WIT .encrypt 


0.2226 


0.2171 


2.5% 


PEG WIT .decrypt 


0.1260 


0.1241 


1.5% 



branch alignment [22], and predicated execution [20, 21], etc. As we finish writing 
this paper, we have discovered that part of our work in Section 3.3 is similar 
to a recent independent effort by Kreahling et al [16]. They present a profile- 
based condition merging technique to replace the execution of multiple branches, 
which have different conditions, with a single branch. Their technique, however, 
does not consider branches separated by intermediate statements. Neither do 
they consider nested IF statements, which we consider in Section 3.3. We have 
also given an analysis of the trade-off which is missing in [16]. Moreover they 
restrict the conditions in the candidate IF statements to be comparisons between 
variables and constants. We do not have such restrictions. 

Calder and Grunwalcl propose an improved branch alignment based on the 
architectural cost model and the branch prediction architecture. Their branch 
alignment algorithm can improve a broad range of static and dynamic branch 
prediction architectures. In [23], Mueller and Whalley describe an optimization 
to avoid conditional branches by replicating code. They perform a program anal- 
ysis to determine the conditional branches in a loop which can be avoided by code 
replication. They do not merge branches separated by intermediate statements. 
In [17], Yang et al describe reordering the sequences of conditional branches us- 
ing profiling data. By branch reordering, the number of branches executed at 
run-time is reduced. These techniques seem orthogonal to our IF-merging. 
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4 Conclusion 

In this extended abstract, we use computation reuse and IF-merging as two 
examples of expanding the scope of redundancy removal. We show that both 
program execution time and energy consumption can be reduced quite substan- 
tially via such operation reuse techniques. It is clear that profile information is 
important in both examples. We believe that a general model for redundancy 
detection can be highly useful for uncovering more opportunities of redundancy 
removal. As our next step, our research group is investigating alternative models 
for this purpose. 
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Abstract. Application energy consumption has become an increasingly 
important issue for both high-end microprocessors and mobile and em- 
bedded devices. A multitude of circuit and architecture-level techniques 
have been developed to improve application energy efficiency. However, 
relatively less work studies the effects of compiler transformations in 
terms of application energy efficiency. In this paper, we use energy- 
estimation tools to profile the execution of benchmark applications. The 
results show that energy consumption due to memory instructions ac- 
counts for a large share of total energy. An effective compiler technique 
that can improve energy efficiency is memory redundancy elimination. 
It reduces both application execution cycles and the number of cache 
accesses. We evaluate the energy improvement over 12 benchmark appli- 
cations from SPEC2000 and MediaBench. The results show that memory 
redundancy elimination can significantly reduce energy in the processor 
clocking network and the instruction and data caches. The overall ap- 
plication energy consumption can be reduced by up to 15%, and the 
reduction in terms of energy-delay product is up to 24%. 



1 Introduction 

Application energy consumption has become an increasingly important issue for 
the whole array of microprocessors spanning from high-end processors used in 
data centers to those inside mobile and embedded devices. Energy conserva- 
tion is currently the target of intense research efforts. A multitude of circuit 
and architecture-level techniques have been proposed and developed to reduce 
processor energy consumption [1, 2, 3]. 

However, many of these research efforts focus on hardware techniques, such 
as dynamic voltage scaling (DVS) [1, 4, 5] and low-energy cache design [2, 6, 3]. 
Equally important, application-level techniques are necessary to make program 
execution more energy efficient, as ultimately, it is the applications executed by 
the processors that determine the total energy consumption. 

In this paper, we look at how compiler techniques can be used to improve 
application energy efficiency. In Section 2, we use energy profiling to identify 
top energy consuming micro-architecture components and motivate the study of 
memory redundancy elimination as a potential technique to reduce energy. Sec- 
tion 3 overviews a new algorithm for memory redundancy detection and presents 



L. Rauchwerger (Ed.): LCPC 2003, LNCS 2958, pp. 288-305, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 



Memory Redundancy Elimination 289 



Table 1 . SimpleScalar Simulator Configuration 



Parameter 


Value 


Parameter 


Value 


Processor Core 


Memory Hierarchy 


RUU Size 
LSQ Size 
Fetch Queue Size 
Fetch Width 
Decode Width 
Issue Width 
Commit Width 
Function Units 


64 instructions 
32 instructions 
8 instructions 
4 instructions/cycle 
4 instructions/cycle 
4 instructions/cycle 
4 instructions/cycle 
4 integer ALUs 
1 integer multiply 
1 integer divide 
1 FP add 
1 FP multiply, 

1 FP divide/sqrt 


LI Data Cache 

LI Instruction Cache 

L2 Cache 

Memory 

TLB 


64KB, 2- way (LRU), 

32B block, 1 cycle latency 
64KB, 2- way (LRU), 

32B block, 1 cycle latency 
Unified, 2MB, 4- way (LRU), 
32B block, 12-cycle latency 
100-cycle latency 
128 entry, fully associative, 
30-cycle miss latency 


Branch Prediction 




Branch predictor 


Combined: 4K chooser 








Bimodal: 4K table 








2-Level: IK table 








10-bit history 






BTB 


1024-entry, 2-way 






Returned Address Stack 


32-entry 






Misprediction Penalty 


7 cycles 







two frameworks to remove redundant memory instructions. Section 4 presents 
the experimental results. Section 5 summarizes related work and Section 6 con- 
cludes the paper. 

2 Energy Profiling 

Optimizing compilers have been successful to improve program performance [7]. 
One key reason is that accurate performance models are used to evaluate vari- 
ous code transformations. Similarly, an automatic program energy efficiency op- 
timization requires accurate energy dissipation modeling. Unfortunately, most 
energy estimation tools work on the circuit and transistor level and require de- 
tailed information from circuit design. Recently, researchers have started to build 
higher level energy modeling tools, such as Wattch [8] and SimplePower [9] , which 
can estimate power and energy dissipation of various micro-architecture compo- 
nents. When combined with instruction-level performance simulators, these tools 
provide an ideal infrastructure to evaluate compiler optimizations targeting en- 
ergy efficiency. 

In our work, we use the Wattch [8] tool along with the SimpleScalar [10] sim- 
ulators to study compiler techniques in terms of application energy consumption. 
Our approach is as follows: we first profile application execution and measure 
energy consumption breakdown by major processor components. This step will 
reveal how energy is dissipated. Based on the application energy profile, we can 
then identify promising code optimizations to improve energy efficiency. 

In CMOS circuits, dynamic power consumption accounts for the major share 
of power dissipation. We use Wattch to get the dynamic power breakdown of 
superscalar processors. The processor configuration is shown in Table 1, which 
is similar to those of the Alpha 21264. The Wattch tool is configured to use 
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Dynamic Power of Components 




Fig. 1 . Active dynamic power consumption by micro-architecture components 



parameters of a .35um process at 600 MHz with supply voltage of 2.5V. The 
pie-chart in Figure 1 shows the percentage of dynamic power dissipation for the 
micro-architecture components, assuming each component is fully active. 

Figure 1 shows that the top dynamic power-dissipating components are the 
global clocking network and on-chip caches. Combined together, they account 
for more than 70% of the total dynamic power dissipation. This suggests that 
the clocking network and caches should be the primary targets for compiler 
techniques to improve energy efficiency. 

The processor energy of dynamic switching {Ed) can be defined as: 

Ed = aCVl 

In the above equation of Ed , C is the load capacitance, Vdd is the supply voltage 
and a is the switching activity factor indicating how often logic transitions from 
low to high take place [11]. C and Vdd are dependent on the particular process 
technology and circuit design, while the activity factor a is related to the codes 
being executed [11]. The main leverage for the compiler to minimize Ed is to 
reduce a. 

We ran 12 benchmark applications and profiled the energy consumption. The 
benchmarks are chosen from the widely used SPEC2000 and MediaBench [12]. 
Table 2 shows the descriptions of the benchmark applications. We compiled the 
benchmark applications using GNU GCC compiler with -04 level optimization. 
The compiled executables are then run on the out-of-order superscalar simulator 
with Wattch to collect run-time and energy statistics. Table 3 shows the total 
energy consumption and energy in the clocking network, top level I-Cache and 
D- Cache. 

The results show that the energy distribution of the micro-architecture com- 
ponents is very similar to the power distribution graph in Figure 1. The major 
difference is that all applications exhibit good cache locality, and L2 cache is 
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Table 2. Test Benchmarks 



Benchmark 


Description 


Input 


adpcm 


16-to-4 bit voice encoding 


clinton. pcm 


g 721 


CCITT G.721 voice encoding 


clinton. pcm 


gsm 


GSM speech encoding 


clinton. pcm 


epic 


Pyramid image encoding 


test_img.pgm 


pegwit 


Elliptic curve public key encryption 


news.txt 


mpeg2dec 


MPEG-2 video decoding 


child, mpg 


181. mcf 


Combinational optimization 


test input 


164.gzip 


Compression 


test input 


256.bzip2 


Compression 


test input 


175. vpr 


FPGA placement and routing 


test input 


197. parser 


Link grammar parser of English 


test input 


300.twolf 


Circuit placement and routing 


test input 



Table 3. Application energy consumption and energy consumption by clocking net- 
work, top level I-Cache and D-Cache. Energy unit is mj (10 -3 Joule) 



Benchmark 


Total (mJ) 


Clock (mJ) 


Ratio 


I-Cache (mJ) 


Ratio 


D-Cache (mJ) 


Ratio 


adpcm 


257.33 


76.90 


29.9% 


46.45 


18.1% 


14.09 


5.5% 


g 721 


9,079.94 


2,739.31 


30.2% 


1,402.15 


15.4% 


1,034.61 


11.4% 


gsm 


6,394.06 


1,917.07 


30.0% 


975.21 


15.3% 


1,072.65 


16.8% 


epic 


1,860.85 


590.95 


31.8% 


244.62 


13.2% 


158.69 


8.5% 


pegwit 


1,236.13 


350.43 


28.4% 


188.82 


15.3% 


197.86 


16.0% 


mpeg 


5,227.71 


1,570.38 


30.0% 


683.34 


13.1% 


614.02 


11.8% 


181. mcf 


8,743.05 


2,823.21 


32.3% 


1,175.15 


13.4% 


1,393.48 


15.9% 


164.gzip 


27,590.61 


8,199.88 


29.7% 


3,806.98 


13.8% 


4,491.96 


16.3% 


256.bzip2 


60,783.20 


19,197.48 


31.6% 


8,451.11 


13.9% 


9,419.49 


15.5% 


175. vpr 


32,053.54 


9,625.57 


30.0% 


4,261.32 


13.3% 


5,267.46 


16.4% 


197. parser 


10,090.21 


3,101.89 


30.7% 


1,466.54 


14.5% 


1,636.30 


16.2% 


300.twolf 


10,620.75 


3,208.99 


30.2% 


1,555.77 


14.7% 


1,694.24 


16.0% 


Geometric Mean 






30.4% 




14.4% 




13.2% 



rarely accessed due to the low number of top level cache misses. Therefore, com- 
pared to energy in other components, energy in L2 cache is negligible due to 
infrequent access activities. As shown in Table 3, energy consumption in the 
clocking network and top level cache accounts for a large share of total applica- 
tion energy. For the 12 applications, the clocking network and LI cache account 
for more than 58% (geometric mean) of total energy. 

Table 4 shows the dynamic instruction count and dynamic load and store 
count. The results show that memory instructions account for about 24% (geo- 
metric mean) of dynamic instructions, and for the more sophisticated SPEC2000 
applications, the percentage is even higher, with a geometric mean of 36%. The 
large number of dynamic memory instructions have the following consequences: 
first, these instructions must be fetched from the I-Cache before execution, thus 
costing energy in the I-Cache; second, instruction execution also costs energy, 
including that in the clocking network; and thirdly, the execution of memory in- 
structions also requires D-Cache access, and this is the major cause of D-Cache 
energy consumption. As both the clocking network and caches are top power- 
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Table 4. Memory instruction count and ratio. Benchmarks compiled with GCC -04 



Benchmark 


Total 


Load 


Ratio 


Store 


Ratio 


Load+Store 


adpcm 


9,136,002 


460,201 


5.0% 


117188 


1.3% 


6.3% 


g 721 


332,838,710 


43,242,018 


13.0% 


11531406 


3.5% 


16.5% 


gsm 


243,270,088 


41,358,696 


17.0% 


11716196 


4.8% 


21.8% 


epic 


59,933,404 


7,505,703 


12.5% 


1007723 


1.7% 


14.2% 


pegwit 


44,934,340 


9,241,265 


20.6% 


2767074 


6.2% 


26.7% 


mpeg 


174,648,416 


26,999,461 


15.5% 


6320129 


3.6% 


19.1% 


181. mcf 


263,801,691 


67,462,222 


25.6% 


38954777 


14.8% 


40.3% 


164.gzip 


872,033,479 


204,006,573 


23.4% 


87834271 


10.1% 


33.5% 


256.bzip2 


1,953,921,052 


425,238,535 


21.8% 


147407675 


7.5% 


29.3% 


175. vpr 


1,044,347,692 


314,776,025 


30.1% 


105206772 


10.1% 


40.2% 


197. parser 


322,027,162 


88,673,262 


27.5% 


26627863 


8.3% 


35.8% 


300.twolf 


337,032,127 


96,499,758 


28.6% 


32200073 


9.6% 


38.2% 


Geometric Mean 






18.3% 




5.5% | 


24.0% 



dissipating components, memory instructions thus have significant impact on 
total energy consumption. 

The above energy profiling data indicate that memory instructions are good 
target to improve application energy efficiency. Redundant memory instructions 
represent wasted energy; removing them [14,40] should reduce energy costs. The 
dominant modern processor architecture is the load-store architecture, in which 
most instructions operate on data in the register file, and only loads and stores 
can access memory. Between the processor core and main memory, the I-Cache 
stores the instructions to be fetched and executed; the D-Cache serves as local 
copy of memory data, so loads and stores can access data faster. When redundant 
memory instructions are removed, the traffic from memory through the I-Cache 
to the CPU core is reduced because fewer instructions are fetched. This saves 
energy in the I-Cache. Data accesses in the D-Cache are also reduced, saving 
energy in the D-Cache. Finally, removing memory instructions speeds up the 
application and saves energy in the clocking network. 

In our prior analysis, the clocking network and cache structures are among 
the top energy consuming components in the processor. Thus energy savings in 
these components can significantly reduce total energy consumption. The rest 
of this paper will present the compile-time memory redundancy elimination and 
evaluate its effectiveness to improve energy efficiency. 

3 Memory Redundancy Elimination 

Memory redundancy elimination is a compile-time technique to remove unnec- 
essary memory instructions. Consider the sample C code in Figure 2; in the 
functions full_red, par_cond and par_loop, the struct field accesses by p->x 
and p->y are generally compiled into loads. However, the loads in line 11 and 12 
are fully redundant with those in line 10, as they always load the same values at 
run time; similarly, the loads in line 19 are partially redundant with those in line 
17 when the conditional statement is executed; the loads in line 25 are partially 
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1 struct parm { 

2 int x; 

3 int y; 

4 }; 

5 struct parm pa = {3, 7}; 

6 struct parm pb = {2001, 2002}; 

7 

8 void full_red(struct parm *p, 

int ^result) 

9 { 

10 result [0] = p->x + p->y; 

11 result [1] = p->x - p->y; 

12 result [2] = p->x + p->y; 

13 } 

14 void par_cond( struct parm *p, 

int ^result) 

15 { 

16 if (p— >x > 10) { 

17 result [0] = p->x + p->y; 

18 } 

19 result [1] = p->x - p->y; 

20 } 



21 void par _loop (struct parm *p, 

int ^result) 

22 { 

23 int i ; 

24 for (i=0; i<100; i++) { 

25 result [i] = p->x + p->y; 

26 } 

27 } 

28 void client () 

29 { 

30 int r [6] [100] ; 

31 full_red(&pa, r[0]); 

32 full_red(&pb, r[l]); 

33 par_cond(&pa, r[2]); 

34 par_cond(&pb, r[3]); 

35 par_loop(&pa, r[4]); 

36 par_cond(&pb, r[5]); 

37 } 



Fig. 2. Memory Redundancy Example. The loads in line 11 and 12 for p->x and p->y 
are fully redundant; the loads in line 19 for p->x and p->y are partially redundant due 
to conditional statement in line 16; the loads for p->x and p->y in line 25 are partially 
redundant as they are loop invariant 

redundant, as the load values need to be loaded only for the first loop iteration 
and all the remaining iterations load the same values. These redundant loads 
can be detected and removed at compile time. As we discussed in Section 2, 
memory instructions incur significant dynamic energy consumption, so memory 
redundancy elimination can be an effective energy-saving transformation. 

In our prior work [13], we presented a new static analysis algorithm to detect 
memory redundancy. This algorithm uses value numbering on memory opera- 
tions, and is the basis for memory redundancy removal techniques described 
in this paper. In comparison, this paper extends our work in [13] by providing 
a more powerful removal framework which is capable to eliminate a larger set 
of memory redundancies; furthermore, this paper focuses on energy efficiency 
benefits, while the previous work concerns about performance improvements. 
In Section 3.1, we first give an overview of this memory redundancy detection 
algorithm; and in Section 3.2, we present code transformations which use the 
analysis results of the detection algorithm to remove those fully and partially 
redundant memory instructions. 
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3.1 Finding Memory Redundancy 

In [13], we presented a new static analysis algorithm to detect memory redun- 
dancy. We extended Simpson’s optimistic global value-numbering algorithm [14, 
15] to value number memory instructions. SCCVN is a powerful procedure-scope 
scalar redundancy (i.e. non-memory instructions) detection algorithm. It dis- 
covers value-based identical scalar instructions (as opposed to lexical identities), 
performs optimistic constant propagation, and handles a broad set of algebraic 
identities. 

To extend SCCVN so that it can also detect identities for loads and stores, we 
annotated the memory instructions in the compiler’s intermediate representation 
(IR) with M-lists - lists of the names of memory objects that are potentially 
defined by the instruction (an M-DEF list) and the names of those that are po- 
tentially used by the instruction (an M-USE list). The M-lists are computed by 
a flow-insensitive, context-insensitive, Andersen-style pointer analysis [16]. Our 
compiler uses a low-level, RiSC-style, three-address IR, called Iloc. All memory 
accesses in Iloc occur on load and store instructions. The other instructions 
work from an unlimited set of virtual registers. The Iloc load and store code 
with M-lists for line 10 in Figure 2 is shown in the following: 

iLD rl => r4 M-use [@pa_x @pb_x] 
iLD r2 => r5 M-use [@pa_y @pb_y] 
iST r3 r7 M-use [Or] M-def [Or] 

As an example, the M-USE in iLD rl => r4 M-use [@pa_x @pb_x] means the 
integer load will load from address rl, put result in r4, and the load may access 
memory object pa . x or pb . x. This corresponds to p->x in the source code. In the 
annotated IR with M-lists, loads only have M-USE list, as loads don’t change 
the states of those referenced memory objects, and during value numbering, 
the value numbers of names in M-USE indicate both before and after-states of 
memory objects affected by the loads; stores are annotated with both M-USE 
and M-def, as stores may write new values to memory objects, and during value 
numbering of stores, the value numbers of names in M-USE indicate the states 
before the execution of stores, and the value numbers in M-DEF indicates the 
states after the execution of stores. 

Using M-list, we can value number memory instructions along with scalar 
instructions and detect instruction identities. To value number memory instruc- 
tions, both normal instruction operands (base address, offset, and result) and 
M-list names are used as a combined hash-key to look up values in the hash 
table. If there is a match, the memory instructions will access the same ad- 
dress with the same value and change the affected memory objects into identical 
states, therefore the matching instructions are redundant. For example, after 
value numbering, the three loads which correspond to p->x in function full_red 
in Figure 2, all have the same form as iLD rl_vn => r4_vn M-use [@pa_x_vn 
@pb_x_vn] ; therefore the three loads are identities, and the last two are redun- 
dant and can be removed to reuse the value in register r4_vn. Also in Figure 2, for 
those loads of p->x and p->y in the functions par_cond and par_loop, memory 
value numbering can detect they are redundant. 
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AVLOCi = computed locally as in Section 3.2 

{ 0 if i is the entry block; 

0 AVOUT/i otherwise. 

h(zpred(i) 

AVOUTi = AVIN; U AVLOCi 



Fig. 3. CSE Data Flow Equation System 



3.2 Removing Memory Redundancy 

After memory redundancies are detected, code transformations are used to elim- 
inate the redundant instructions. We have used two different techniques to 
perform the elimination phase: traditional common subexpression elimination 
(CSE) [17] and partial redundancy elimination (PRE) [18, 19, 20]. Using mem- 
ory value numbering results, we can easily extend scalar CSE and PRE and 
build unified frameworks that remove both scalar and memory-based redundan- 
cies. Memory CSE was first described in [13] and is briefly recapitulated in this 
paper for completeness; memory PRE is a more powerful removal framework 
and can eliminate a larger set of memory redundancies. This section shows the 
two frameworks, extended to include memory redundancy removal. 



Available Expressions Traditional common subexpression elimination (CSE) 
finds and removes redundant scalar expressions (sometimes called fully redun- 
dant expressions) . It computes the set of expressions that are available on entry 
to each block as a data-flow problem. An expression e is available on entry to 
block b if every control-flow path that reaches b contains a computation of e. Any 
expression in the block that is also available on entry to the block (in AVIN) is 
redundant and can be removed. 

Figure 3 shows the equations used for value-based CSE. To identify fully 
redundant memory instructions, for equivalent memory instructions, we assign 
them a unique id number. The AVLOCi set for block i is computed by adding 
scalar values and memory IDs defined in i. When the equations in Figure 3 are 
solved, the AVINi set contains the available scalar values and memory IDs at 
the entry of block i. Fully redundant instructions (including redundant memory 
instructions) can be detected and removed by scanning the instructions in i 
in execution order as follows: if scalar instruction s computes v £ AVINi, s 
is redundant and removed; if memory instruction with id to £ AVINi, m is 
redundant and removed. For the example in Figure 2, the new memory CSE 
removes the 4 redundant loads on line 11 and 12 as they are assigned same IDs 
as those in line 10. 



Partial Redundancy Elimination The key idea behind partial redundancy 
elimination (pre) and lazy code motion is to find computations that are redun- 
dant on some, but not all paths [18, 19, 20]. Given an expression e at point p 
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ANTLOC; D LATERIN; otherwise. 



Placement 



Fig. 4. PRE Data Flow Equation System 



that is redundant on some subset of the paths that reach p, the transformation 
inserts evaluations of e on paths where it had not been, to make the evaluation 
at p redundant on all paths. Our transformation is based on the formulation due 
to Drechsler and Stadel [20]. 

Drechsler and Stadel’s formulation computes the sets INSERT and DELETE 
for scalar expressions in each block. The set INSERT^j contains those par- 
tially redundant expressions that must be duplicated along the edge i —> j. The 
DELETEi set contains expressions in block i that are redundant and can be 
removed. The data-flow equations are shown in Figure 4. 

PRE is, essentially, a code motion transformation. Thus, it must preserve 
data dependences during the transformation. (The flow, anti, and output de- 
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pendences of the original program must be preserved [7].) The results from our 
new memory redundancy detection algorithm let us model dependence relations 
involving memory instructions and remove redundant loads. 1 

To encode the constraints of load motion into the equations for PRE, we must 
consider both the load address and the states of the memory objects in the M- 
USE list for the load. Specifically, a load cannot be moved past the instructions 
of its addressing computation; in addition, other memory instructions might 
change the states of the memory objects that the load may read from, so a load 
cannot be moved past any memory instruction which assigns new value number 
(i.e. defines a new state) to memory object in the M-USE list of the load. Using 
memory value numbering results, We can build a value dependence graph that 
encodes the dependence relationship among the value numbers of the results 
of scalar instructions and the value numbers of the M-DEF and M-USE lists for 
memory instructions. In particular, 1) for each scalar instruction, the instruction 
becomes the def node that defines the value number of its result; furthermore, 
we also add a dependence edge from each DEF node of the source operands to 
the scalar instruction node; 2) for a store, the store becomes the DEF node that 
defines the value numbers of any objects on its M-DEF list that are assigned new 
value numbers; 3) for a load, the instruction becomes the DEF node for the load 
result, and we also add edges from the DEF nodes for the load address and the 
value numbers of any memory objects on the load M-USE list. 

Intuitively, the value numbers of scalar operands and M-list objects cap- 
ture the DEF-USE relations among scalar and memory instructions. Stores can be 
thought of as DEF-points for values stored in the memory objects on the M-DEF; 
the value dependence edges between stores and loads which share common mem- 
ory objects represent the flow dependences between store and load instructions. 
Thus, using the value numbers assigned by the memory redundancy detection 
algorithm, we can build the value dependence graph so that it represents the 
dependence relations for both scalar and memory instructions. 

Once the value dependence graph has been built, the compiler can build 
the local set alteredi for each block. The alteredi set contains the instructions 
whose source operands would change values due to the execution of block i. If 
e £ alteredi , then the code motion should not move e backward beyond i, as 
otherwise it would violate the dependence rule. We set alteredi to include all 
instructions in block i other than scalar and load instructions. This prevents the 
algorithm from moving those instructions. Furthermore, any instructions that 
depend transitively on these instructions are also included in alteredi. This can 
be computed by taking the transitive closure in the value dependence graph with 
respect to the DEF nodes for the instructions in i. 

Another local set ANTLOCi contains the candidate instructions for PRE 
to remove. In traditional applications of PRE, ANTLOCi only contains scalar 

1 We exclude stores from PRE for two reasons. First, loads do not create anti and out- 
put dependences. Fixing the positions of stores greatly simplifies dependence graph 
construction. Second, and equally important, our experiments show that opportuni- 
ties to remove redundant stores are quite limited [21], 
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Front End: Analysis/Transformation Passes Back End: 

c2i on ILOC i2ss 




Fig. 5. ILOC Execution Model 



instructions. Using the value numbers for M-lists, we can model memory de- 
pendences and put loads into ANTLOCi. We set ANTLOCi to contain both 
scalar and load instructions in block i which are not in alter edi] in other words, 
it contains the scalars and loads whose movement is not restricted by i. The last 
local set in the PRE framework is AVLOCi. It contains the all scalars and loads 
in i. 

By treating memory instructions in this way, we force the data-flow system to 
consider them. When the data flow system is solved, the INSERT and DELETE 
sets contain the scalar instructions and loads that are partially redundant and 
can be removed. In the example in Figure 2, the partially redundant loads for 
p->x and p->y in line 19 are in the DELETE set, and copies of these loads are 
in the INSERT set of the block where the test conditional is false. Similarly, 
the loads in the loop body in line 25 are also removed and copies of these loads 
are inserted in the loop header. In summary, the memory PRE can successfully 
remove those partial memory redundancies in Figure 2. 

4 Experimental Results 

Figure 5 shows the execution model for our compiler. The C front end (c2i) 
converts the program into Iloc. The compiler applies multiple analysis and 
optimization passes to the Iloc code. Finally, the back end (i2ss) generates 
SimpleScalar executables 

To evaluate the energy efficiency improvement of memory redundancy elim- 
ination, we implemented memory CSE and memory PRE as Iloc passes, 
referred to as M-CSE and M-PRE. As the memory versions of CSE and 
PRE subsume scalar CSE and PRE, to evaluate the effects of memory re- 
dundancy removal, we also implemented the scalar versions of CSE and PRE, 
referred to as S-CSE and S-PRE. We use the same benchmarks in Table 2. The 
benchmarks are first translated into Iloc, then multiple passes of traditional 
compiler optimizations are run on the Iloc codes, including constant propaga- 
tion, dead code elimination, copy coalescing and control-flow simplification. We 
then run the whole-program pointer analysis to annotate the Iloc codes with 
M-lists. We run separately the S-CSE, M-CSE, S-PRE and M-PRE passes on 
the Iloc codes, followed by the SimpleScalar backend i2ss to create the Sim- 
pleScalar executables We then run the generated executables on the out-of-order 
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Table 5. Dynamic load count 



Benchmark 


S-CSE 


M-CSE 


Ratio 


S-PRE 


M-PRE 


Ratio 


adpcm 


445,117 


445,117 


100.0% 


464,893 


464,893 


100.0% 


g721 


44,066,954 


43,425,899 


98.6% 


44,297,744 


43,618,024 


98.5% 


gsm 


34,723,686 


33,797,428 


97.3% 


34,847,422 


21,949,559 


63.0% 


epic 


8,242,764 


7,708,474 


93.5% 


8,232,931 


7,240,634 


88.0% 


pegwit 


11,683,127 


8,437,712 


72.2% 


11,680,913 


8,554,340 


73.2% 


mpeg 


27,046,064 


25,134,366 


92.9% 


26,925,145 


23,944,703 


88.9% 


Geometric Mean 






91.9% 






84.2% 


181. mcf 


70,190,795 


65,194,656 


92.9% 


70,711,865 


61,807,206 


87.4% 


164.gzip 


765,126,630 


534,164,375 


69.8% 


768,743,772 


526,464,600 


68.5% 


256.bzip2 


556,298,644 


373,466,249 


67.1% 


554,923,128 


344,316,419 


62.1% 


175. vpr 


396,021,979 


275,488,845 


69.6% 


397,356,599 


263,224,908 


66.2% 


197. parser 


97,672,948 


82,377,643 


84.3% 


98,506,932 


82,497,510 


83.8% 


300.twolf 


99,816,284 


72,935,505 


73.1% 


98,696,484 


72,542,711 


73.5% 


Geometric Mean 






75.6% 






73.0% 


Overall G-Mean 






83.4% 






78.4% 



Execution Cycles 



I— •— M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSEl 

145% 




superscalar simulator with the Wattch tool and collect the run-time performance 
and energy statistics. 

Dynamic Load Count and Cycle Count Table 5 shows the dynamic load 
count for the benchmarks. The ratio columns show the load count ratio between 
the memory and scalar versions of CSE and PRE. For the majority of the bench- 
mark applications, both M-CSE and M-PRE significantly reduce the dynamic 
load count, with a geometric mean of 16.6% for M-CSE and 21.6% for M-PRE. 
As M-PRE removes memory redundancies from conditionals and loops, it re- 
moves a larger number of memory redundancies than M-CSE. Furthermore, the 
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data show that M-CSE and M-PRE have more opportunities in SPEC2000 
programs than MediaBench: the dynamic load ratios between M-PRE and S- 
PRE are 73% for SPEC2000 and 84.2% for MediaBench; and the ratios between 
M-CSE and S-CSE are 75.6% for SPEC2000 and 91.9% for MediaBench. The 
cause of this difference is that SPEC2000 applications are generally larger and 
more complex than those in MediaBench, and more data references are com- 
piled as memory instructions, which provides more opportunities for memory 
redundancy elimination. 

Figure 6 shows the impact of memory redundancy elimination on application 
execution cycles. As expected, M-PRE achieves the best results as it is the most 
powerful redundancy elimination 2 . The reduction in execution cycle count leads 
to energy savings in the clocking network. Figure 7 shows the normalized clocking 
network energy consumption of M-CSE, S-PRE and M-PRE with S-CSE as 
base. The curves are mostly identical to those in Figure 6. Like the execution 
count results, the benchmarks of SOO.twolf . , 175. vpr, 256.bzip2 and gsm have the 
largest energy savings with M-PRE. 

Cache Energy As we discussed in Section 2, memory redundancy elimination 
reduces cache accesses in both LI I-Cache and D-Cache, thus it saves energy 
in the cache structures. Figure 8 shows the normalized LI I-Cache energy con- 
sumption for M-CSE, S-PRE and M-PRE with S-CSE as the base. Figure 9 
shows the normalized LI D-Cache energy for the four versions. 

In Figure 8 and Figure 9, the curves for M-PRE are the lowest, as M- 
PRE generally incurs the fewest I-Cache and D-Cache accesses, thus achieving 
the largest energy savings. The energy consumption diagrams of the I-Cache 
and D-Cache also show that memory redundancy elimination is more effective 
to reduce the D-Cache energy, as both M-CSE and M-PRE achieve more than 
10% energy savings in the D-Cache for pegwit, 164-gzip , 256.bzip2, 175. vpr and 
SOO.twolf, while the amount of energy savings in the I-Cache are relatively 
smaller. 

Total Energy and Energy-Delay Product Figure 10 shows the normal- 
ized total application energy consumption. Among the redundancy elimination 
techniques, M-PRE produces the best energy efficiency. 

A useful metric to measure both application performance and energy effi- 
ciency is the energy-delay product [11]. The smaller the energy-delay product, 
the better the application energy efficiency and performance. Figure 11 shows 
the normalized energy-delay product with S-CSE as the base. As memory redun- 
dancy elimination reduces both application execution cycles and the total energy 
consumption, the energy-delay product for M-CSE and M-PRE is smaller. In 
contrast to other techniques, such as dynamic voltage scaling, which trade ap- 
plication execution speed to reduce energy consumption, memory redundancy 

2 The large execution cycle count for S-PRE in SOO.twolf is due to abnormally high LI 
I-Cache misses. For other cache configurations, the S-PRE cycle count is generally 
comparable to that of S-CSE. 
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Fig. 8. Normalized LI I-Cache energy consumption 
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Fig. 9. Normalized LI D-Cache energy consumption 
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Fig. 10. Normalized total energy consumption 



Energy-Delay Product 
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elimination boosts both application performance and energy efficiency, making it 
a desirable compiler transformation to save energy without loss in performance. 



Application Energy Breakdown We also studied the micro-architecture 
component energy contribution for total application energy consumption. Fig- 
ures 12 and 13 show the component energy breakdown for 256.bzip2 and 
175.vpr - the two applications which have the largest energy efficiency improve- 
ment. The major energy savings for these two applications come from the clock- 
ing network and top level instruction and data cache. In 256.bzip2, the clocking 
network energy savings for M-CSE and M-PRE are 12% and 15% respectively, 
the LI I-Cache savings are 8% and 10%, and the LI D-Cache savings are 23% 
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Fig. 12. Energy breakdown of 256.bzip2 Fig. 13. Energy breakdown of 175. vpr 



and 24%. The final energy savings are 12% for M-CSE and 15% for M-PRE. 
Similarly, in 175. vpr, the clocking network energy savings for M-CSE and M- 
PRE are 13% and 15% each, the LI I-Cache savings are 10% and 12% each, 
and the LI D-Cache savings are 25% and 26%. The final energy savings on 
175. vpr are 14% for M-CSE and 15% for M-PRE. 



5 Related Work 

Recently, power and energy issues have become critical design constraints for 
both high-end processors and embedded digital devices powered by battery. Re- 
searchers have developed many hardware-based techniques to reduce power and 
energy consumption in these systems. Dynamic voltage scaling (DVS) dynami- 
cally varies processor clock frequency and voltage to save energy and is described 
in [1, 4, 5]. The work in [2, 6, 3] discussed ways to reduce cache energy consump- 
tion. However, all of these are circuit and architecture-level techniques. Relatively 
less focus is put on application-level energy saving techniques. In [22], Kandemir 
et. al. studied the energy effects of loop-level compiler optimizations using array- 
based scientific codes. In contrast to their work, we first profiled the total ap- 
plication energy consumption to identify top energy-consuming components and 
then evaluated one compiler technique - memory redundancy elimination, which 
can significantly reduce energy consumption in these components. Furthermore, 
our technique targets more complicated general purpose and multimedia appli- 
cations. Recently, researchers have been studying compile-time management of 
hardware-based energy saving mechanisms, such as DVS. Hsu et. al. described 
a compiler algorithm to identify program regions where CPU can be slowed down 
with negligible performance loss [23]. Kremer summarized compiler-based energy 
management methods in [24]. These methods are orthogonal to the techniques 
in this paper. 

Both scalar [17, 18, 19, 25] and memory [26, 27, 28, 29, 30] redundancy detec- 
tion and removal have been studied in the literature. The redundancy detection 
algorithm used in our work is described in [13]. Compared to other methods, this 
algorithm unifies the process of scalar and memory redundancy detection and is 
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able to find more redundancies. Most of the previous work concerns application 
run-time speed, while our work targeted toward the benefits of energy savings, 
though the results show that performance is also improved. 

6 Conclusion 

Most of the recent work on low power and energy systems focuses on circuit 
and architecture-level techniques. However, more energy savings are possible by 
optimizing the behavior of the applications. We profiled the energy consumption 
of a suite of benchmarks. The energy statistics identify that the clocking net- 
work and first level cache as the top energy consuming components. With this 
insight, we investigated the energy savings of a particular compiler technique - 
memory redundancy elimination. We present two redundancy elimination frame- 
works and evaluate the energy improvements. The results indicate that memory 
redundancy elimination can reduce both execution cycles and the number of top 
level cache accesses, thus saving energy from the clocking network and the in- 
struction and data caches. For our benchmarks, memory redundancy elimination 
can achieve up to a 15% reduction in total energy consumption, and up to a 24% 
reduction in the energy-delay product. 
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Abstract. Processor virtualization is a powerful technique that enables 
the runtime system to carry out intelligent adaptive optimizations like 
dynamic resource management. Charm++ is an early language/system 
that supports processor virtualization. This paper describes Adaptive 
MPI or AMPI, an MPI implementation and extension, that supports 
processor virtualization. AMPI implements virtual MPI processes (VPs), 
several of which may be mapped to a single physical processor. AMPI 
includes a powerful runtime support system that takes advantage of the 
degree of freedom afforded by allowing it to assign VPs onto processors. 
With this runtime system, AMPI supports such features as automatic 
adaptive overlap of communication and computation and automatic load 
balancing. It can also support other features such as checkpointing with- 
out additional user code, and the ability to shrink and expand the set of 
processors used by a job at runtime. This paper describes AMPI, its fea- 
tures, benchmarks that illustrate performance advantages and tradeoffs 
offered by AMPI, and application experiences. 



1 Introduction 

The new generation of parallel applications are complex, involve simulation of 
dynamically varying systems, use adaptive techniques such as multiple timestep- 
ping and adaptive refinements, and often involve multiple parallel modules. Typ- 
ical implementations of the MPI do not support the dynamic nature of these 
applications well. As a result, programming productivity and parallel efficiency 
suffer. We present AMPI, an adaptive implementation of MPI, that is better 
suited for such applications, while still retaining the familiar programming model 
of MPI. 

The basic idea behind AMPI is to separate the issue of mapping work to pro- 
cessors from that of identifying work to be done in parallel. Standard MPI pro- 
grams divide the computation into P processes, one for each of the P processors. 
In contrast, an AMPI programmer divides the computation into a large num- 
ber V of virtual processors, independent of the number of physical processors. 
The virtual processors are programmed in MPI as before. Physical processors are 
no longer visible to the programmer, as the responsibility for assigning virtual 
processors to physical processors is taken over by the runtime system. This pro- 
vides an effective division of labor between the system and the programmer: the 
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programmer decides what to do in parallel, and the runtime system decides where 
and when to do it. This division allows the programmer to use the most natural 
decomposition for their problems, rather than being restricted by the physical 
machine. For example, algorithmic considerations often restrict the number of 
processors to a power of 2, or a cube, but with AMPI, V can still be a cube even 
though P is prime. 

Note that the number of virtual processors V is typically much larger than P. 
Using multiple virtual processors per physical processor brings several additional 
benefits. 



1.1 Related Work 

The virtualization concept embodied by AMPI is very old, and Fox et al. [1] 
make a convincing case for virtualizing parallel programs. Unlike Fox’s work, 
AMPI virtualizes at the runtime layer rather than manually at the user level, 
and AMPI can use adaptive load balancers. Virtualization is also supported in 
DRMS [2] for data-parallel array based applications. CHARM++ is one of the 
earliest, if not the first, processor- virtualization system implemented on parallel 
machines[3, 4]. AMPI builds on top of CHARM++, and shares the run-time 
system with it. 

There are several excellent, complete, publicly available non- virtualized im- 
plementations of MPI, such as MPICH [5] and MPI/LAM [6]. Many researchers 
have described their implementations for fault-tolerance via checkpoint/restart, 
often built on top of one of the free implementations of MPI like CoCheck [7] 
and StarFish [8]. AMPI differs from these efforts in that it provides full virtual- 
ization to improve performance and allow load balancing rather than solely for 
checkpointing or for fault tolerance. 

Meanwhile there are plenty of efforts in implementing MPI nodes on top of 
light-weight threads. MPI-Lite [9] and TMPI [10] are two good examples. They 
have successfully used threaded execution to improve the performance of message 
passing programs, especially on SMP machines. Adaptive MPI, however, enables 
extra optimization with the capability of migrating the user-level threads that 
MPI processors are executed on. 

The Charm -| — b/AMPI approach is to let the runtime system change the 
assignment of VPs to physical processors at runtime, thereby enabling a broad 
set of optimizations. In the next section, we motivate the project, providing an 
overview of the benefits. In Section 3 we describe how our virtual processors are 
implemented and migrated. Section 4 describes the design and implementation 
strategies for specific features, such as checkpointing and load-balancing. We 
then present performance data showing that these adaptive features are benefi- 
cial in complex applications, and affordable (i.e. present low overhead) in general. 
We will summarize our experience in using AMPI in several large applications. 
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2 Benefits of Virtualization 

In [11], the author has discussed in detail the benefits of processor virtualization 
in parallel programming, and CharmH — b has indeed taken full advantage of 
these benefits. Adaptive MPI inherits most of the merits from Charm H — b, while 
furnishing the common MPI programming environment. Here is a list of the 
benefits that we will detail in this paper. 

— Adaptive Overlap of Communication and Computation: If one of the vir- 
tual processors is blocked on a receive, another virtual processor on the 
same physical processor can run. This largely eliminates the need for the 
programmer to manually specify some static computation/communication 
overlapping, as is often required in MPI. 

— Automatic Load Balancing: If some of the physical processors become over- 
loaded, the runtime system can migrate a few of their virtual processors to 
relatively underloaded physical processors. Our runtime system can make 
this kind of load balancing decision based on automatic instrumentation, as 
explained in Section 4.1. 

— Asynchronous Interfaces to Collective Operations: AMPI supports asyn- 
chronous, or non-blocking, interfaces to collective communication operations 
to allow the overlap between time-consuming collective operations with other 
useful computation. Section 4.2 describes this in detail. 

— Automatic Checkpointing: AMPI’s virtualization allows applications to be 
checkpointed without additional user programming, as described in Sec- 
tion 4.3. 

— Better Cache Performance: A virtual processor handles a smaller set of data 
than a physical processor, so a virtual processor will have better memory 
locality. This blocking effect is the same method many serial cache opti- 
mizations employ. 

— Flexible Usage of Available Processors: The ability to migrate virtual pro- 
cessors can be used to adapt the computation if the available part of the 
physical machine changes. See Section 4.5 for details. 

3 Adaptive MPI 

3.1 AMPI Implementation 

AMPI is built on CHARM+-b, and uses its communication facilities, load bal- 
ancing strategies and threading model. 

CHARM++ uses an object based model: programs consist of a collection of 
message driven objects mapped onto physical processors by Charm-| — b run- 
time system. The objects communicate with other objects by invoking an asyn- 
chronous entry method on the remote object. Upon each of these asynchronous 
invocation, a message is generated and sent to the destination processor where 
the remote object resides. Adaptive MPI implements its MPI processors as 
Charm-) — b “user-level” threads bound to Charm-| — b communicating objects. 
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Fig. 1. An MPI process is implemented as a user-level thread, several of which can be 
mapped to one single physical processor. This virtualization enables several powerful 
features including automatic load balancing and adaptive overlapping 



Message passing between AMPI virtual processors is implemented as commu- 
nication among these Charm-| — b objects, and the underlying messages are han- 
dled by the CHARM++ runtime system. Even with object migration, Charm H — b 
supports efficient routing and forwarding of the messages. 

Charm -| — b supports migration of objects via efficient data migration and 
message forwarding if necessary. Migration presents interesting problems for ba- 
sic and collective communication which are effectively solved by the CHARM++ 
runtime system[12]. 

Migration can be used by the built-in measurement-based load balanc- 
ing [13], adapting to changing load on workstation clusters [14], and even shrink- 
ing/expanding jobs for timeshared machines [15]. 

The threads used by AMPI are user-level threads; they are created and sched- 
uled by user-level code rather than by the operating system kernel. The advan- 
tages of user-level threads are fast context switching 1 , control over scheduling, 
and control over stack allocation. Thus, it is feasible to run thousands of such 
threads on one physical processor (e.g. See [16]). Charm -b+’s user-level threads 
are scheduled non-preemptively. 

3.2 Writing an AMPI Program 

Writing an AMPI program is barely different from writing an ordinary MPI 
program. In fact, a legal MPI program is also a legal AMPI program. To take 
full advantage of the migration mechanism, however, there is one more issue to 
address: global variables. 

Global variable is any variable that is stored at a fixed, preallocated loca- 
tion in memory. Although not specified by the MPI standard, many actual MPI 

1 On a 1.8 GHz AMD AthlonXP, overhead for a suspend/schedule/resume operation 
is 0.45 microseconds. 
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programs assume that global variables can be used independently on each pro- 
cessor, i.e. , global variable x on processor 1 can have a different value than that 
of global variable x on processor 2. However, in AMPI, all the threads on one 
processor share a single address space and thus a single set of global variables; 
and when a thread migrates, it leaves its global variables behind. Another prob- 
lem is global variables shared on the same processor might be changed by other 
threads. Therefore, having global variables is disallowed in AMPI programming. 

3.3 Converting MPI Programs To AMPI 

If the MPI program uses global variables, it cannot run unmodified under AMPI, 
and we need to convert it to fit AMPI. As discussed in section 3.2, for thread 
safety, global variables need to be either removed or “privatized”. To remove 
the global variables from the code, one can collect all the formal globals into 
a single structure (allocated “type” in F90) named, say, “GlobalVars” , which is 
then passed into each function. 

To manually remove all the global variables is sometimes cumbersome, 
though mechanical. Fortunately this can be automated. AMPIzer [17] is our 
source-to-source translator based on Polaris [18] that privatizes global variables 
from arbitrary Fortran77 or Fortran90 code and generates necessary code 
for moving the data across processors. 



4 Features 

In this section, the key features that can help achieving higher parallel perfor- 
mance and alleviate the complexity of parallel programming will be discussed in 
detail. 

4.1 Automatic Load Balancing 

To achieve automatic dynamic load balancing without introducing an excessive 
amount of overhead poses fair challenges. CHARM++ addresses this issue with 
its integrated load balancing strategies, or Load Balancers [13]. The common 
mechanism they share is: during the execution of the program, a load balanc- 
ing framework collects workload information on each physical processor in the 
background, and when the program hands over the control to a load balancer, 
it uses this information to redistribute the workload, and migrate the parallel 
objects between the processors as necessary. 

As there are different answers to the questions of (1) what information to 
collect, (2) where the information is processed, and (3) how to design the re- 
distribution scheme, there are different types of load balancing strategies. For 
the first question, some load balancers look at computation workload only, while 
others take inter-processor communication into consideration. For the second 
question, some load balancers contribute the information to a central agent in 
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the system for processing, whereas others only have objects exchange informa- 
tion with their neighbors and make decisions locally. At the last link, some load 
balancers randomly redistribute the workload and hope for the best, as opposed 
to having deliberate algorithms to help determine the new distribution toward 
better balance. For more detail, please refer to [13] and Charm++ manuals. 

A key issue in automatic load balancing is to cleanly move objects from one 
processor to another processor. CHARM+-I- natively supports object migration; 
but in the context of AMPI, thread migration required several interesting addi- 
tions to the runtime system, as described in the following sections. 



Isomalloc Stacks A user-level thread, when suspended, consists of a stack and 
a set of preserved machine registers. During migration, the machine registers are 
simply copied to the new processor. The stack, unfortunately, is very difficult 
to move. In a distributed memory parallel machine, if the stack is moved to 
a new machine, it will almost undoubtedly be allocated at a different location, 
so existing pointers to addresses in the original stack would become invalid when 
the stack moves. We cannot reliably update all the pointers to stack-allocated 
variables, because these pointers are stored in machine registers and stack frames, 
whose layout is highly machine- and compiler-dependent. 

Our solution is to ensure that even after a migration, a thread’s stack will stay 
at the same address in memory that it had on the old processor. This means all 
the pointers embedded in the stack will still work properly. Luckily, any operating 
system with virtual memory support has the ability to map arbitrary pages in 
and out of memory. Therefore we merely need to mmap the appropriate address 
range into memory on the new machine and use it for our stack. To ensure 
that each thread allocates its stack at a globally unique range of addresses, the 
available virtual address space is divided into P regions, each for one thread 
respectively. This idea of “isomalloc” approach to thread migration is based on 
PM 2 [19]. 



Isomalloc Heaps Another obvious problem with migrating an arbitrary pro- 
gram is dynamically allocated storage. Unlike the thread stack, which the system 
allocated, dynamically allocated locations are known only to the user program. 

The “isomalloc” strategy available in the latest version of AMPI uses the 
same virtual address allocation method used for stacks to allocate all heap data. 
Similarly, the user’s heap data is given globally unique virtual addresses, so it can 
be moved to any running processor without changing its address. Thus migra- 
tion is transparent to the user code, even for arbitrarily interlinked, dynamically 
allocated data structures. To do this, AMPI must intercept and handle all mem- 
ory allocations done by the user code. On many UNIX systems, this can be done 
by providing our own implementation of malloc. Machines with 64-bit pointers, 
which are becoming increasingly common, support a large virtual address space 
and hence can fully benefit from isomalloc heaps. 
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Limitations During migration, we do not preserve a thread’s open files and 
sockets, environment variables, or signals. However, threads are only migrated 
when they call the special API routine MPI_Migrate, so currently the non- 
migration-safe features can be used at any other time. The intention is to sup- 
port these operations via a thread-safe AMPI specific API, which will work with 
migration, in the future. Thread migration between different architectures on 
a heterogeneous parallel machine is also not supported. 2 

4.2 Collective Communication Optimization 

Collective communications are required in many scientific applications, as they 
are used in many basic operations like high dimensional FFT, LU-factorization 
and linear algebra operations. These communications involves many or all pro- 
cessors in the system, which makes them complex and time-consuming. AMPI 
uses the Charm- 1 — h communication library[20, 21] to optimize its collective 
communication. This library uses two intelligent techniques in optimizing col- 
lective communications. For small messages, messages are combined and routed 
via intermediate processors to reduce the software overhead. For large messages, 
network contention, the dominant factor in the total cost, is lowered by smart 
sequencing of the messages based on the underlying network topology. 

Beside the above optimization inherited from Charm++, AMPI has its own 
improvement on the collective communication operations. If we take a closer 
look at the time spent on collective communications, only a small portion of the 
total time is software overhead, namely the time CPU spends on communication 
operations. Especially, a modern NIC with communication co-processor performs 
message management through remote DMA so that this operation requires very 
little CPU interference. On the other hand, the MPI standard defines collective 
operations like MPI_Alltoall and MPI_Allgather to be blocking, wasting the CPU 
time on waiting for the communication calls to return. To better utilize the 
computing power of CPU, we can make the collective operations non-blocking 
to allow useful computation while other MPI processors are waiting for slower 
collective operations. 

In IBM MPI for AIX [22], the similar non-blocking collectives were imple- 
mented but not well benchmarked or documented. Our approach differs from 
IBM’s in that we have more flexibility of overlapping, since the light-weight 
threads we use are easier to schedule to make full use of the physical processors. 

4.3 Checkpoint and Restart 

As Stellner describes in his paper on his checkpointing framework [23], process 
migration can easily be layered on top of any checkpointing system by simply 
rearranging the checkpoint files before restart. AMPI implements checkpointing 

2 This will require extensive compiler support or a common virtual machine. Alter- 
natively, stack-copying threads along with user-supplied pack/unpack code can be 
used to support AMPI in heterogeneous environment. 
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in exactly the opposite way. In AMPI, rather than migration being a special kind 
of checkpoint/restart, checkpoint/restart is seen as a special kind of migration - 
migration to and from the disk. 

A running AMPI thread checkpoints itself by calling MPI_Checkpoint with 
a directory name. Each thread drains its network queue, migrates a copy of itself 
into a file in that directory, and then continues normally. The checkpoint time 
is dominated by the cost of the I/O, since very little communication is required. 

There are currently two ways to organize the checkpoint files: (1) All threads 
on the same physical processor will group into one single disk file to reduce the 
number of files to be created, (2) Each thread has its own file. In the second op- 
tion, because AMPI system checkpoints threads rather than physical processors, 
an AMPI program may be restored on a larger or smaller number of physical 
processors than was it started on. Thus a checkpoint on 1000 processors can 
easily be restarted on 999 processors if, for example, a processor fails during the 
run. 

4.4 Multi-module AMPI 

Large scientific programs are often written in a modular fashion by combining 
multiple MPI modules into a single program. These MPI modules are often 
derived from independent MPI programs. 

Current MPI programs transfer control from one module to another strictly 
via subroutine calls. Even if two modules are independent, idle time in one cannot 
be overlapped with computations in the other without breaking the abstraction 
boundaries between the two modules. In contrast, AMPI allows multiple sep- 
arately developed modules to interleave execution based on the availability of 
messages. Each module may have its own “main”, and its own flow of control. 
AMPI provides cross- communicators to communicate between such modules. 

4.5 Shrink-Expand Capability 

AMPI normally migrates virtual processors for load balance, but this capability 
can also be used to respond to the changing properties of the parallel machine. 
For example, Figure 2 shows the conjugate gradient solver responding to the 
availability of several new processors. The time per step drops dramatically as 
virtual processors are migrated onto the new physical processors. 



5 AMPI Benchmarks 

In this section we use several benchmarks to illustrate the aspects of performance 
improvement that AMPI is capable of. One of the basic benchmarks here is 
2D grid-based stencil-type calculation. It is a multiple timestepping calculation 
involving a group of objects in a mesh. At each timestep, every object exchanges 
part of its data with its neighbors and does some computation based on the 
neighbors’ data. The objects can be organized in a 2D or 3D mesh, and 1-away 
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Timestep Number 



Fig. 2. Time per step for the million-row conjugate gradient solver on a workstation 
cluster. Initially, the application runs on 16 machines. 16 new machines are made 
available at step 600, which immediately improves the throughput 



or 2-away neighbors may be involved. Depending on these different choices, the 
number of points in the stencil computation can range from 5 to 13. Although this 
is a simplified model of many applications, like fluid dynamics or heat dispersion 
simulation, it can well serve the purpose of demonstration. We have chosen 
Lemieux, the supercomputerat Pittsburgh Supercomputing Center [24], as the 
major benchmark platform. 

5.1 Adaptive Overlapping 

In Adaptive MPI, Virtual Processors are message-driven objects mapped onto 
physical processors. Several VPs can be mapped onto one physical processor, and 
the message passing among VPs is really communication between these objects. 

We have explained this in Section 3.1. Now we will show the first benefit of 
virtualization: adaptive overlapping of computation with communication and it 
can improve the utilization of CPUs. 

Figures 3 and 4 are the timeline from the visualization tool for Charm -| — (-: 
Projections 3 . In the timelines, x direction is time and y direction shows 8 physical 
processors. For each processor, the solid block means it is in use, while the 
gap between blocks is idle time. Figures shown are from 2 separate runs of 
2D 5-point stencil calculation. In the first run, only one VP is created on each 
physical processor, so there is no virtualization allowed. In the second run, 8 
VPs are created for each physical processor, with each VP taking less amount of 
computation, the total problem size is the same. In the displayed portion of the 
execution time, in Figure 3 we can see there are obvious gaps between blocks, 
and the overall utilization is around 70%. This illustrates the CPU time wasted 
while waiting for blocking communication to return. In Figure 4, however, the 

3 Manual available at http://finesse.cs.uiuc.edu/manuals/ 
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Fig. 3. Timeline of 1024 2 2D 5-point stencil calculation on Lemieux. No virtualization 
is used in this case: one VP per processor 




Fig. 4. Timeline of 1024 2 2D 5-point stencil calculation on Lemieux. Virtualization 
ratio is 8: eight VPs created on each processor 



gaps of communication are filled with smaller chunks of computation: when one 
object is waiting for its communication to return, other objects on the processor 
can automatically take over and do their computation, eliminating the need 
for manual arrangement. With the adaptive overlapping of communication and 
computation, the average utilization of CPU is boosted to around 80%. 

5.2 Automatic Load Balancing 

In parallel programming, load imbalance is to be very carefully avoided. Unfor- 
tunately, load imbalance, especially dynamic load imbalance, appears frequently 
and is difficult to remove. For instance, consider a simulation on a mesh, where 
part of the mesh has a more complicated structure than the rest of the mesh, 
and the load within this mesh is imbalanced. As another example, when adaptive 
mesh refinement (AMR) is in use, hot-spots can arise where the mesh structure 
is highly refined. This dynamic type of load imbalance requires more program- 
mer/system interference to remove. AMPI, using the automatic load balancing 
mechanism integrated in Charm-| — b system, accomplishes the task of removing 
static and dynamic load imbalance automatically. 

As a simple benchmark, we modified the 5-point stencil program by dividing 
the mesh in a 2D stencil calculation into 2 part: in the first 1/16 mesh, all objects 
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Fig. 5. Utilization of 16 processors before(Left) and after(Right) automatic load bal- 
ancing in a non-uniform stencil calculation 




Fig. 6. Overall CPU utilization before and after automatic load balancing in a non- 
uniform stencil calculation 



do 2-away (13-point) calculation, while the rest do 1-away (5-point) calculation. 
The load on the 1/16 processors is thus much heavier than that on the rest 
15/16. The program used 128 AMPI VPs on 16 processors. 

Although it is an artificial benchmark, it represents a common situation: 
very small fraction of overloaded processors potentially ruin the overall perfor- 
mance of all processors. The load balancer is employed to solve this problem, as 
shown in Figure 5 and 6. According to Figure 5, one of the 16 processors are 
overloaded while others are underloaded, with average utilization less than 60% 
before load balancing, while after load balancing, the variation of the workload is 
diminished and the overall utilization is about 20% higher. Correspondingly, the 
average time per iteration drops from 1.15ms to 0.85ms. Figure 6 demonstrates 
how the load balancer is activated and utilization increased from 55% to 85% 
approximately. Note that this load balancing is all automatically done by the 
system; there is no programmer interference needed at all. 



5.3 Collective Communication Optimization 

MPI standard defines the collective operations as blocking, which makes it im- 
possible to overlap them with computation, because many or all processors are 
blocked waiting for the collective operation to return. In Section 4.2 we discussed 
the optimization of supporting non-blocking collective operations to allow over- 
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Fig. 7. Breakdown of execution time of 2D FFT benchmark on 4, 8, and 16 processors, 
with comparison between blocking(MPI) and non-blocking(AMPI) all-to-all operations 



lapping. Now we illustrate how this feature can save the execution time in parallel 
applications. 

In [25], a parallel algorithm for Quantum Molecular Dynamics is discussed. 
One complexity in the algorithm arises from 128 independent and concurrent 
sets of 3D FFTs. Although each of the FFT can be parallelized, overlapping 
between different sets of FFTs is difficult due to the all-to-all operation required 
for transposing data in each FFT. However, AMPI’s non-blocking all-to-all op- 
eration allows the programmer to overlap the communication and computation 
from consecutive sets of FFT and save execution time. 

To make a benchmark based on this application, we simplified the above 
problem. We do two independent sets of 2D FFT, each consisting of the one ID 
FFT, transpose, and another ID FFT. To pipeline the operations, we move the 
second ID FFT of the first set after the transpose of the first set. In the blocking 
version, however, this pipelining is not gaining any performance, because the 
transpose, implemented as blocking all-to-all communication, stops any other 
computation from being done. In the non-blocking version, the second set is 
able to do real computation while the first set is waiting for its communication 
to complete. 

Figure 7 demonstrates the effect of overlapping collective communications 
with computation. The y axis is different number of processors, for blocking ver- 
sion(labeled as MPI) and non-blocking version (labeled as AMPI) respectively, 
and the x axis is the execution time. Using distinct colors in the stacked bars, we 
denote the breakdown of the overhead for ID FFT (computation) , communica- 
tion, and for non-blocking version, the waiting time for non-blocking operation, 
as discussed in Section 4.2. 

It can be observed that the two versions have similar amounts of computation, 
but in terms of communication, the non-blocking version has advantage because 
part of its waiting time is reduced by overlapping it with computation. The AMPI 
bar is 10% - 20% shorter than the MPI bar, the amount of saving depending on 
the amount of possible overlap. This saving could be even larger if there is more 
computation for overlap. 
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Table 1 . Execution time [ms] of 240 3 3D 7-point stencil calculation on Lemieux 



# Procs 


Native MPI 


AMPI(l) 


AMPI(K) 


8 

27 

64 

125 

216 

512 

1728 


29.440 

14.162 

9.121 

8.066 

5.519 

4.499 


318.488 

41.415 

16.433 

11.504 

6.506* 

6.486 

3.521* 


104.909 

28.166 

12.670 

11.590 

8.365 

5.645 



5.4 Flexibility and Overhead 

In this section we are going to show the flexibility virtualization provides, as well 
as the overhead virtualization incurs. Our benchmark is 240 3 3D 7-point stencil 
calculation. 

First we run it with native MPI on Lemieux. Because the model of the 
program divides the job into AT-cubed partitions, not surprisingly, the program 
runs only on a cube number of processors. On Adaptive MPI with virtualization, 
the program runs transparently on any given number of processors, exhibiting 
the flexibility that virtualization offers. The comparison between these two runs 
are visualized in Table 2. The performances on native MPI and on Adaptive 
MPI appear to have very little difference. Note that on some “random” number 
of PEs, like 19 and 140, the native MPI program is not able to run, while AMPI 
handles the situation perfectly. 

Now let’s take a closer look at the speedup data of the same program running 
on native MPI, AMPI with 1 VP per processor and AMPI with multiple (K=4 - 
10) VPs per processor. Table 1 displays the execution time of the same size 
problem running on increasing number of processors, with the best K values 
shown in AMPI(Lf) column. 

Comparing the execution time of native MPI against AMPI, we find that 
although native MPI outperforms AMPI in many cases as expected, it does so 
by only a small amount. Thus, the flexibility and load balancing advantages of 
AMPI do not come at an undue price in basic performance 4 . In some cases, 
nevertheless, AMPI does a little better. For example AMPI(AT) is faster than 
native MPI when number of processors is small. This is due to the caching 
effect; many VPs grouped on one processor will increase the locality of data as 
well as instructions. The advantage of this caching effect is shown in Table 1, 
where AMPI with virtualization outperforms AMPI(l) on smaller number of 
processors. When there are many processors involved, the cost of coordinating 
the VPs takes over and offset the caching effect. Two results (marked by in 
Table 1) are anomalous, and we have not identified the underlying causes yet. 

4 A microbenchmark shows an average of 2/rs for a context switch between the threads 
with which AMPI VPs are associated, on an 400MHz PHI Xeon processor. 
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Table 2. Execution time[ms] of AMPI v.s. Native MPI, of 240 3 3D 7-point stencil 
calculation on Lemieux 



#PE 


19 


27 


33 


64 


80 


105 


125 


140 


175 


216 


250 


512 


Native MPI 
AMPI 


N/A 

42.410 


29.440 

30.528 


N/A 

24.646 


14.162 

15.635 


N/A 

12.621 


N/A 

10.935 


9.121 

10.776 


N/A 

10.616 


N/A 

9.388 


8.066 

8.626 


n7a 

7.549 


5.519 

5.464 



6 AMPI Experience: Rocket Simulation 

The Center for Simulation of Advanced Rockets (CSAR) is an academic re- 
search organization funded by the Department of Energy and affiliated with the 
University of Illinois. The focus of CSAR is the accurate physical simulation 
of solid-propellant rockets, such as the Space Shuttle’s solid rocket boosters. 
CSAR consists of several dozen faculty from ten different engineering and sci- 
ence departments, as well as 18 professional staff. The main CSAR simulation 
code consists of four major components: a fluid dynamics simulation, for the 
hot gas flowing through and out of the rocket; a surface burning model for the 
solid propellant; a nonmatching but fully-coupled fluid/solid interface; and fi- 
nally a finite-element solid mechanics simulation for the solid propellant and 
rocket casing. Each one of these components - fluids, burning, interface, and 
solids - began as an independently developed parallel MPI program. 

One of the most important early benefits CSAR found in using AMPI is the 
ability to run a partitioned set of input files on a different number of virtual 
processors than physical processors. For example, a CSAR developer was faced 
with an error in mesh motion that only appeared when a particular problem 
was partitioned for 480 processors. Finding and fixing the error was difficult, 
because a job for 480 physical processors can only be run after a long wait in the 
batch queue at a supercomputer center. Using AMPI, the developer was able 
to debug the problem interactively, using 480 virtual processors distributed over 
32 physical processors of a local cluster, which made resolving the error much 
faster and easier. 

Because each of the CSAR simulation components are developed indepen- 
dently, and each has its own parallel input format, there are difficult practical 
problems involved in simply preparing input meshes that are partitioned for the 
correct number of physical processors available. Using AMPI, CSAR developers 
often simply use a fixed number of virtual processors, which allows a wide range 
of physical processors to be used without repartitioning the problem’s input files. 

As the solid propellant burns away, each processor’s portion of the problem 
domain changes, which will change the CPU and communication time required 
by that processor. The most important long-term benefit that the CSAR codes 
will derive from AMPI is the ability to adapt to this changing computation by 
migrating work between processors, taking advantage of the Charm++ load 
balancing framework’s demonstrated ability to optimize for load balance and 
communication efficiency. Because the CSAR components do not yet change 
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the mesh structure during a run. and merely distort the existing mesh, the 
computation and communication patterns of the virtual MPI processors do not 
yet change. However, this mesh distortion breaks down after a relatively small 
amount of motion, so the ability to adjust the mesh to the changing problem 
domain is scheduled to be added soon. 

Finally, the CSAR simulator’s current main loop consists of one call to each 
of the simulation components in turn, in a one-at-a-time lockstep fashion. This 
means, for example, the fluid simulation must finish its timestep before the solids 
can begin its own. But because each component runs independently except at 
well-defined interface points, and AMPI allows multiple independent threads of 
execution, we will be able to improve performance by splitting the main loop into 
a set of cooperating threads. This would allow, for example, the fluid simulation 
thread to use the processor while the solid thread is blocked waiting for remote 
data or a solids synchronization. Separating each component should also improve 
our ability to optimize the communication balance across the machine, since 
currently the i’th fluids processor has no physical correspondence with the i’th 
solids processor. 

In summary, AMPI has proven a useful tool for the CSAR simulation, from 
debugging to day-to-day operations to future plans. 

7 Conclusions 

We have presented AMPI, an adaptive implementation of MPI on top of 
Charm -| — h AMPI implements migratable virtual and light-weight MPI pro- 
cessors. It assigns several virtual processors on each physical processor. This 
efficient virtualization provides a number of benefits, such as the ability to au- 
tomatically load balance arbitrary computations, automatically overlap compu- 
tation and communication, emulate large machines on small ones, and respond 
to a changing physical machine. Several applications are being developed using 
AMPI, including those in rocket simulation. 

AMPI is an active research project; much future work is planned for AMPI. 
We expect to achieve full MPI-1.1 standards conformance soon, and MPI-2 there- 
after. We are rapidly improving the performance of AMPI, and should soon be 
quite near that of non-migratable MPI. The Charm- 1 — h performance analysis 
tools are being updated to provide more direct support for AMPI programs. 
Finally, we plan to extend our suite of automatic load balancing strategies to 
provide machine-topology specific strategies, useful for future machines such as 
BlueGene/L. 
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Abstract. We explore advances in Java Virtual Machine (JVM) tech- 
nology along with new high performance I/O libraries in Java 1.4, and 
find that Java is increasingly an attractive platform for scientific cluster- 
based message passing codes. 

We report that these new technologies allow a pure Java implementa- 
tion of a cluster communication library that performs competitively with 
standard C-based MPI implementations. 



1 Introduction 

Previous efforts at Java-based message-passing frameworks have focused on mak- 
ing the functionality of the Message Passing Interface (MPI) [1] available in 
Java, either through native code wrappers to existing MPI libraries (mpiJava [2], 
JavaMPI [3]) or pure Java implementations (MPIJ [4]). Previous work showed 
that both pure Java and Java/native MPI hybrid approaches offered substan- 
tially worse performance than MPI applications written in C or Fortran with 
MPI bindings. 

We have built Message Passing Java, or MPJava, a pure- Java message pass- 
ing framework. We make extensive use of the java.nio package introduced in 
Java 1.4. Currently, our framework provides a subset of the functionality avail- 
able in MPI. MPJava does not use the Java Native Interface (JNI). The JNI, 
while convenient and occasionally necessary, violates type safety, incurs a perfor- 
mance penalty due to additional data copies between the Java and C heaps, and 
prevents the JVM’s Just-In Time (JIT) compiler from fully optimizing methods 
that make native calls. 

MPJava offers promising results for the future of high performance message 
passing in pure Java. On a cluster of Linux workstations, MPJava provides per- 
formance that is competitive with LAM-MPI [5] for the Java Grande Forum’s 
Ping-Pong and All-to-All microbenchmarks. Our framework also provides perfor- 
mance that is comparable to the Fortran/LAM-MPI implementation of a Con- 
jugate Gradient benchmark taken from the NASA Advanced Supercomputing 
Parallel Benchmarks (NAS PB) benchmark suite. 
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2 Design and Implementation 

We have designed MPJava as an MPI-like message passing library implemented 
in pure Java, making use of the improved I/O capabilities of the java.nio pack- 
age. MPJava adheres to the Single Program Multiple Data (SPMD) model used 
by MPI. Each MPJava instance knows how many total nodes are in use for the 
computation, as well as its own unique processor identification tag (PID). Us- 
ing this information, the programmer can decide how to split up shared data. 
For example, if 10 nodes are being used for one MPJava computation, a shared 
array with 100 elements can store elements 0-9 on node 0, 10-19 on node 1, etc. 
Data can be exchanged between nodes using point-to-point send() and recv() 
operations, or with collective communications such as all-to-all broadcast. Dis- 
tributing data in this manner and using communication routines is typical of 
MPI, OpenMP, and other parallel programming paradigms. 



2.1 Functionality 

The MPJava API provides point-to-point sendQ and recv() functions: 

send( int peer, int offset, int len, double [] arr ) recv( int 
peer, int offset, int len, double [] arr ) 

These high-level functions abstract away the messy details related to TCP, 
allowing the user to focus on the application rather than the message passing 
details. 

MPJava also provides a subset of the collective communication operations 
typically available to a message passing library such as MPI. For example, if an 
array with 100 elements is distributed between 10 nodes, an all-to-all broadcast 
routine can be used to recreate the entire array of 100 elements on each node: 
alltoallBroadcast ( double [] arr, int distribution ) 

The distribution parameter is a constant that tells MPJava how the data 
is distributed between nodes. The default setting we use is as follows: an ar- 
ray with n elements will be split between p nodes with each node holding n/p 
elements and the last node holding n/p+ nmodp elements. Other distribution 
patterns are possible, though we employ the simple default setting for our ex- 
periments. 

2.2 Bootstrapping 

MPJava provides a series of start-up scripts that read a list of hostnames, per- 
form the necessary remote logins to each machine, and start MPJava processes 
on each machine with special arguments that allow each MPJava process to find 
the others. The result of the bootstrap process is a network of MPJava pro- 
cesses where each process has TCP connections to every other process in the 
network. These TCP connections are used by the nodes for point-to-point as 
well as collective communications. 
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2.3 Collective Communication Algorithms 

We explored two different all-to-all broadcast algorithms: a multi-threaded con- 
current algorithm in which all pairs of nodes exchange data in parallel, and 
a parallel prefix algorithm that only uses a single thread. 

In the concurrent algorithm, each node has a separate send and receive 
thread, and the select () mechanism is used to multiplex communication to 
all the other processors. 

In the parallel prefix implementation, data exchange proceeds in /oc/ 2 (n) 
rounds, sending 2 r ~ 1 pieces of data in each round, where r is the current round 
number. For example, if there were 16 total nodes, node 0 would broadcast 
according to the following schedule: 

Example broadcast schedule for node 0 with 16 total nodes 



round 


partner 


data 


1 


1 


0 


2 


2 


0,1 


3 


4 


0-3 


4 


8 


0-7 



3 Introduction to Java.nio 

Java’s New I/O APIs (java.nio), are defined in Java Specification Request 
(JSR) 51 [6]. These New I/O, or NIO, libraries were heavily influenced and 
address a number of issues exposed by the pioneering work of Matt Welsh et. al. 
on JAGUAR [7] and Chi-Chao Chang and Thorsten von Eicken on JAVIA [ 8 ]. 

3.1 Inefficiencies of Java.io and Java.net 

The original java, io and java.net libraries available prior to JDK 1.4 perform 
well enough for client-server codes based on Remote Method Invocation (RMI) in 
a WAN environment. The performance of these libraries is not suitable, however, 
for high-performance communication in a LAN environment due to several key 
inefficiencies in their design: 

— Under the java.io libraries, the process of converting between bytes and 
other primitive types (such as doubles) is inefficient. First, a native method 
is used that allows a double to be treated as a 64 bit long integer (The JNI 
is required because type coercions from double to long are not allowed under 
Java’s strong type system). Next, bit-shifts and bit-masks are used to strip 
8 bit segments from the 64 bit integer, then write these 8 bit segments into 
a byte array. 

java.nio buffers allow direct copies of doubles and other values to/from 
buffers, and also support bulk operations for copying between Java arrays 
and java.nio buffers. 
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— The java, io operations work out of an array of bytes allocated in the Java 
heap. Java cannot pass references to arrays allocated in the Java heap to 
system-level I/O operations, because objects in the Java heap can be moved 
by the garbage collector. 

Instead, another array must be allocated in the C heap and the data must 
be copied back and forth. Alternatively, to avoid this extra overhead, some 
JVM implementations “pin” the byte array in the Java heap during I/O 
operations. 

java.nio buffers can be allocated as DirectBuf f ers, which are allocated 
in the C heap and therefore not subject to garbage collection. This allows 
I/O operations with no more copying than what is required by the operating 
system for any programming language. 

— Prior to NIO, Java lacked a way for a single thread to poll multiple sock- 
ets, and the ability to make non-blocking I/O requests on a socket. The 
workaround solution of using a separate thread to poll each socket intro- 
duces unacceptable overhead for a high performance application, and simply 
does not scale well as the number of sockets increases. 

java.nio adds a unix-like select () mechanism in addition to non-blocking 
sockets. 

3.2 Java.nio for High-Performance Computing 

MPJava demonstrates that Java can deliver performance competitive with MPI 
for message-passing applications. To maximize the performance of our frame- 
work we have made careful use of several java.nio features critical for high- 
performance computing: channels, select (), and buffers. 

Channels such as SocketChannel are a new abstraction for TCP sockets 
that complement the Socket class available in java.net. The major differences 
between channels and sockets are that channels allow non-blocking I/O calls, 
can be polled and selected by calls to java.nio’s select!) mechanism, and 
operate on java.nio .ByteBuffers rather than byte arrays. In general, channels 
are more efficient than sockets, and their use, as well as the use of select!), is 
fairly simple, fulfills an obviou need, and is self-explanatory. 

The use of java.nio .Buff ers, on the other hand, is slightly more compli- 
cated, and we have found that careful use of buffers is necessary to ensure max- 
imal performance of MPJava. We detail some of our experiences with buffers 
below. 

One useful new abstraction provided by NIO is a Buffer, which is a container 
for a primitive type. Buffers maintain a position, and provide relative put I) and 
get!) methods that operate on the element specified by the current position. 
In addition, buffers provide absolute put tint index, byte val) andgettint 
index) methods that operate on the element specified by the additional index 
parameter, as well as bulk put!) and get!) methods that transfer a range of 
elements between arrays or other buffers. 

ByteBuffers allocated as DirectByteBuf f ers will use a backing store allo- 
cated from the C heap that is not subject to relocation by the garbage collector. 
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The DirectByteBuf f er the results can be directly passed as arguments to sys- 
tem level calls with no additional copying required by the JVM. 

Because direct buffers are expensive to allocate and garbage collect, we pre- 
allocate all of the required buffers. The user sees a collective communication API 
call much like an MPI function; our framework handles behind-the-scenes any 
necessary copies of data between the user’s arrays and our pre-allocated direct 
buffers. 

For the best performance, it is important to ensure that all buffers are set to 
the native endianness of the hardware. Because Java is platform independent, 
it is possible to create buffers that use either big-endian or little-endian formats 
for storing multi- byte values. Furthermore, the default byte-order of all buffers is 
big-endian, regardless of the native byte order of the machine where the JVM is 
executing. To communicate among a set of heterogeneous platforms with mixed 
byte orders, one would need to perform some extra bookkeeping, and weather 
some performance overhead in the process. In our experience, this has never 
been an issues, as most clusters consist of a homogenous set of machines. 

ByteBuffer provides an asDoubleBuff er () method that returns 
a DoubleBuf f er, which is a “view” of the chunk of backing data that is 
shared with the ByteBuffer. Maintaining multiple “views” of the same piece 
of data is important for three reasons: First, while ByteBuffer supports 
operations to read or write other primitive types such as doubles or longs, 
each operation requires checks for alignment and endian-ness in addition to 
the bounds checks typical of Java. Next, ByteBuffer does not provide bulk 
transfer operations for other primitive types. Finally, all socket I/O calls require 
ByteBuffer parameters. NIO solves all of these issues with multiple views: 
DoubleBuf fer provides bulk transfer methods for doubles that do not require 
checks for alignment and endian-ness. Furthermore, these transfers are visible 
to the ByteBuffer “view” without the need for expensive conversions, since the 
ByteBuffer shares the same backing storage as the DoubleBuf fer. 

Maintaining two views of each buffer is cumbersome but manageable. We map 
each DoubleBuf fer to its corresponding ByteBuffer with an IdentityHashMap 
and take care when changing the position of one of these buffers, as changes to 
the position of one buffer are not visible to other “views” of the same backing 
data. Furthermore, we are careful to prevent the overlap of simultaneous I/O 
calls on the same chunk of backing data, as the resulting race condition leads to 
nasty, unpredictable bugs. 

The MPJava API calls in our framework take normal Java arrays as param- 
eters. This approach requires that data be copied from arrays into buffers before 
the data can be passed to system-level OS calls. To avoid these extra copy oper- 
ations, we initially implemented our framework with an eye towards performing 
all calculations directly in buffers. Not only does this strategy requires a more 
complicated syntax (buffers must be manipulated via put ( ) and get ( ) methods 
rather than the cleaner square bracket notation used with Java arrays), but the 
performance penalty for repeated put ( ) and get ( ) methods on a buffer is as 
much as an order of magnitude worse than similar code that uses Java arrays. It 
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turns out that the cost of copying large amounts of data from arrays into buffers 
before every send (and from buffers to arrays after each receive) is less than the 
cost of the put ( ) and get ( ) methods required to perform computations in the 
buffers. 



4 Performance Results 

We conducted these experiments on a cluster of Pentium III 650 MHz ma- 
chines with 768MB RAM, running Redhat Linux 7.3. They are connected by 
two channel-bonded 100 Mbps links through a Cisco 4000 switch capable of 
switching at maximum 45 million packets/s or 64 GB/s. We compared Fortran 
codes compiled with the g77-2.96 and linked with LAM-MPI 6.5.8 against MP- 
Java compiled with JDK-1.4.2-b04 and mpiJava 1.2.3 linked with mpich 1.2.4. 

We use mpich as the underlying MPI implementation for mpiJava because 
mpiJava supports mpich but not LAM. We chose LAM over mpich for our other 
experiments because LAM (designed for performance) delivers better perfor- 
mance than mpich (designed primarily for portability). 



4.1 Ping-Pong 

First we compare our MPJava framework with LAM-MPI and java.io for 
a ping-pong benchmark. The benchmark, based on the Java Grande Forum’s 
ping-pong benchmark, measures the maximum sustainable throughput between 
two nodes by copying data from an array of doubles on one processor into an 
array of doubles on the other processor and back again. The results are given 
in Figure 1. The horizontal axis represents the number of doubles swapped be- 
tween each pair of nodes. To avoid any performance anomalies occurring at the 
powers of two in the OS or networking stack, we adopt the Java Grande Forum’s 
convention of using values that are similar to the powers of two. The vertical 
axis, labeled Mbits/s, shows bandwidth calculated as the total number of bits 
exchanged between a pair of nodes, divided by the total time for the send and 
receive operations. We only report results for the node that initiates the send 
first, followed by the receive, to ensure timing the entire round-trip transit time. 
Thus, the maximum bandwidth for this benchmark is 100 Mbps, or half of the 
hardware maximum. We report the median of five runs because a single slow 
outlier can impact the mean value by a significant amount, especially for small 
message sizes where the overall transmission time is dominated by latency. 

We used two different java.io implementations: java.io (doubles), which 
performs the necessary conversions from doubles to bytes and vice versa, and 
java.io (bytes), which sends an equivalent amount of data between byte ar- 
rays without conversions. The java, io (doubles) implementation highlights the 
tremendous overhead imposed by conversions under the old I/O model, while 
the results for java, io (bytes) represent an upper bound for performance of the 
old Java I/O model. 
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PingPong between two nodes 




doubles swapped 

Fig. 1 . Ping-Pong performance for MPJava, LAM-MPI, mpijava, and java.io. Note 
that java.io (doubles) performs conversions between doubles and bytes, while java.io 
(bytes) does not 



It is not surprising that our java.nio-enabled MPJava framework outper- 
forms the java.io doubles implementation because the conversions are ex- 
tremely inefficient. However, MPJava also outperforms the java.io (bytes) im- 
plementation for data sizes larger than about 2000 doubles. We surmise that this 
is due to inefficiencies in java, io’s buffering of data. Although both implemen- 
tations need to copy data from the Java heap into the C heap, MPJava needs to 
copy data from a Java array into a pre-allocated direct buffer that does not need 
to be cleaned up, while the java.io (bytes) implementation needs to allocate 
and then clean-up space in the C heap. This may be an expensive operation on 
some JVMs. 

The native LAM-MPI implementation provides better performance than MP- 
Java for message sizes until about 1000 doubles, while MPJava provides superior 
performance for sizes larger than 7000 doubles. 

The main contribution of this particular experiment is the empirical evidence 
we provide that Java is capable of delivering sustained data transfer rates com- 
petitive with available MPI implementation of this common microbenchmark. 

4.2 All- to- All 

The next microbenchmark we implemented was an all-to-all band- 
width utilization microbenchmark based on the Java Grande Forum’s 

JGFAlltoAllBench . j ava. 

The all-to-all microbenchmark measures bandwidth utilization in a more 
realistic manner than ping-pong. An all-to-all communication is necessary when 
a vector shared between many nodes needs to be distributed, with each node 
sending its portion to every other node. Thus, if there are n nodes and the vector 
has v total elements, each node must communicate its v/n elements to n - 1 peers. 
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doubles exchanged between each pair of nodes 



Fig. 2. All-To-All performance for MPJava, prefix algorithm 







doubles exchanged between each pair of nodes 



Fig. 3. All- To- All performance for LAM-MPI 



Figure 2 represents the results of our framework using the parallel prefix 
algorithm, while Figure 4 shows the results for the concurrent algorithm. Figure 3 
illustrates the performance of the same microbenchmark application written in 
C using the LAM-MPI library, and Figure 5 shows the results for mpiJava with 
bindings to mpich. Note that we do not chart mpiJava’s performance for 32 
nodes because the performance achieved was under 1 Mbps. 

The values on the X-axis represent the number of doubles exchanged between 
each pair of nodes. A value v on the X-axis means that a total of v * (n - 1) bytes 
were transmitted, where n is the number of nodes used. The actual values selected 
for the X-axis are the same as those used in the ping-pong microbenchmark 
previously, for the same reason. 

The Y-axis charts the performance in megabits/s (Mbps). We chose the me- 
dian value of many runs because a single slow outlier can negatively impact the 
mean value, especially for small message sizes where overall runtimes are domi- 
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MPJava All— to— all Benchmark (concurrent algorithm) 




Fig. 4. All-To-All performance for MPJava, concurrent algorithm 




doubles exchanged between each pair of nodes 



Fig. 5. All-To-All performance for mpijava 



nated by latency. Thus, the “dips” and other irregularities are repeatable. Note 
that Figures 2, 3 and 4 have the same scale on the Y-axis, and the theoretical 
hardware maximum for this experiment is 200 Mbps. 

The MPJava concurrent broadcast algorithm occasionally outperforms the 
parallel prefix algorithm; however, the performance of the concurrent algorithm 
is not consistent enough to be useful. We believe this is due at least in part to 
sub-optimal thread scheduling in the OS and/or JVM. In addition, we were not 
able to achieve true concurrency for this experiment because the machines we 
used for our experiments have only 1 CPU. 

MPJava’s parallel prefix algorithm outperformed the LAM-MPI implemen- 
tation for large message sizes. We ascribe these differences to the difference in 
the broadcast algorithms. Parallel prefix has a predictable send/receive schedule, 
while LAM-MPI uses a naive all-to-all algorithm that exchanges data between 
each pair of nodes. 
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The comparison with mpiJava is somewhat unfair because MPICH, the un- 
derlying native MPI library for mpiJava, gave substantially worse performance 
than LAM-MPI. However, the comparison does provide evidence of some of the 
performance hurdles that must be overcome for Java to gain acceptance as a vi- 
able platform for clustered scientific codes. 

While it is possible that a C-based MPI implementation could use a more 
sophisticated broadcast strategy that outperforms our current implementation, 
there is no reason why that strategy could not be incorporated into a java.nio 
implementation that would achieve similar performance. 

4.3 CG 

Our final performance results are for the NAS PB Conjugate Gradient (CG) 
benchmark [9]. The CG benchmark provides a more realistic evaluation of the 
suitability of Java for high performance scientific computation because it con- 
tains significant floating point arithmetic. 

The CG algorithm uses the inverse power method to find an estimate of the 
largest eigenvalue of a symmetric positive definite sparse matrix with a random 
pattern of nonzero values. 

The kernel of the CG algorithm consists of a multiplication of the sparse 
matrix A with a vector p followed by two reductions of a double, then a broadcast 
of the vector p before the next iteration. These four core operations comprise 
over 80% of the runtime of the calculation. This kernel iterates 25 times, and is 
called by the CG benchmark 75 times to approximate a solution with the desired 
precision. 

We have evaluated the CG benchmark for the Class B and Class C sizes. 



Class 


rows of A 


total nonzeroes in A 


avg. nonzeroes/row 


B 


75,000 


13,708,072 


183 


C 


150,000 


36,121,058 


241 



The data used by the CG benchmark is stored in Compressed Row Storage 
(CRS) format. The naive way to parallelize this algorithm is to divide the m 
rows of the A matrix between n nodes when performing the A.p matrix-vector 
multiplication, then use an all-to-all broadcast to recreate the entire p vector 
on each node. We implemented this approach in Fortran with MPI and also 
MPJava, and provide results for this approach in Figure 6. 

Because g77 does not always adequately optimize code, we also ran the NAS 
CG benchmark using pgf90, the Portland Compiler Group’s optimizing Fortran 
compiler. The performance was nearly identical to the g77 results. It is likely 
that even a sophisticated compiler cannot optimize in the face of the extra layer 
of indirection required by the CRS storage format for the sparse matrix A. 

The NAS PB implementation of CG performs a clever two-dimensional de- 
composition of the sparse matrix A that replaces the all-to-all broadcasts with 
reductions across rows of the decomposed matrix. The resulting communication 
pattern can be implemented with only sendO and recv() primitives, and is 
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Fig. 6. Conjugate Gradient, Class B: MPJava (mpj), Simple Fortran (for). Note 
that for each pair of stacked bar charts, MPJava is the leftmost, simple Fortran is the 
rightmost 



Conjugate Gradient, Two-Dimensional algorithm, Class B Conjugate Gradient, Two-Dimensional algorithm, Class C 




Fig. 7. Conjugate Gradient, Class B: MPJava (mpj), Original NAS Fortran (for). 
Note that for each pair of stacked bar charts, MPJava is the leftmost, NAS Fortran is 
the rightmost 



more efficient than using collective communications. We implemented the more 
sophisticated decomposition algorithm used by the NAS CG implementation in 
MPJava, and provide results in Figure 7. 

We instrumented the codes to time the three major contributors to the run- 
time of the computation: the multiplication of the sparse matrix A with the 
vector p, the all-to-all broadcast of the vector p, and the two reductions required 
in the inner loop of the CG kernel. All four versions of the code perform the 
same number of floating point operations. We report results for four versions: 
naive Fortran (Fortran), the naive MPJava (MPJava), Fortran with the 2D de- 
composition (Fortran 2D), and MPJava with the 2D decomposition (MPJava 
2D). These results are in Table 1 for the Class B problem size, and Table 2 for 
the Class C problem size. 

The results of the naive algorithm presented in Figure 6 show that MPJava is 
capable of delivering performance that is very competitive with popular, freely- 
available, widely-deployed Fortran and MPI technology. The poor performance 
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Table 1. Raw results for Class B Conjugate Gradient benchmark. All times reported 
are in seconds 



language 


nodes 


total runtime 


A • P 


broadcast 


reductions 


other 


Fortran 


2 


1006 


897 


60 


6 


43 


MPJ 


2 


1052 


944 


44 


2 


61 


Fortran 2D 


2 


892 


754 


87 


1 


50 


MPJ 2D 


2 


907 


776 


74 


2 


56 


Fortran 


4 


561 


452 


80 


4 


26 


MPJ 


4 


574 


450 


64 


26 


34 


Fortran 2D 


4 


519 


376 


68 


40 


34 


MPJ 2D 


4 


556 


392 


76 


52 


37 


Fortran 


8 


337 


216 


99 


5 


17 


MPJ 


8 


351 


232 


76 


21 


21 


Fortran 2D 


8 


261 


144 


73 


27 


17 


MPJ 2D 


8 


289 


183 


52 


39 


15 


Fortran 


16 


268 


113 


132 


8 


15 


MPJ 


16 


228 


119 


83 


11 


15 


Fortran 2D 


16 


184 


77 


74 


20 


13 


MPJ 2D 


16 


190 


87 


59 


35 


10 


Fortran 


32 


361 


56 


273 


6 


25 


MPJ 


32 


203 


60 


90 


36 


17 


Fortran 2D 


32 


107 


34 


48 


18 


7 


MPJ 2D 


32 


122 


46 


42 


29 


5 



Table 2. Raw results for Class C Conjugate Gradient benchmark. All times reported 
are in seconds 



language 


nodes 


total runtime 


A • P 


broadcast 


reductions 


other 


Fortran 


2 


2849 


2601 


124 


12 


112 


MPJ 


2 


2827 


2587 


86 


6 


149 


Fortran 2D 


2 


2622 


2346 


140 


2 


134 


MPJ 2D 


2 


2684 


2430 


107 


2 


145 


Fortran 


4 


1532 


1285 


160 


24 


63 


MPJ 


4 


1558 


1339 


123 


12 


85 


Fortran 2D 


4 


1482 


1178 


142 


76 


85 


MPJ 2D 


4 


1534 


1256 


98 


86 


94 


Fortran 


8 


881 


648 


183 


8 


41 


MPJ 


8 


879 


664 


143 


22 


51 


Fortran 2D 


8 


732 


504 


128 


57 


43 


MPJ 2D 


8 


774 


571 


102 


53 


47 


Fortran 


16 


602 


322 


238 


10 


32 


MPJ 


16 


531 


322 


156 


20 


32 


Fortran 2D 


16 


482 


253 


128 


70 


30 


MPJ 2D 


16 


459 


274 


97 


57 


31 


Fortran 


32 


846 


157 


623 


8 


58 


MPJ 


32 


422 


159 


177 


56 


31 


Fortran 2D 


32 


260 


99 


109 


37 


16 


MPJ 2D 


32 


264 


132 


77 


41 


14 



observable at 32 nodes for the Fortran code reflects the fact that LAM-MPI’s 
all-to-all collective communication primitive does not scale well. These results 
highlight the importance of choosing the appropriate collective communication 
algorithm for the characteristics of the codes being executed and the hardware 
configuration employed. 
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The results of the 2D decomposition algorithm presented in Figure 7 also 
show MPJava to be competitive with Fortran and MPI. Although the MPJava 
performance is slightly worse, it is within 10% of the Fortran/MPI results. Popu- 
lar wisdom suggests that Java performs at least a factor of 2 slower than Fortran. 
While there is much work left to do in the field of high-performance Java com- 
puting, we hope that our results help bolster Java’s case as a viable platform for 
scientific computing. 

The results of this benchmark suggest that MPJava is capable of delivering 
performance comparable to or in excess of the performance achievable by native 
MPI/C applications. 

In addition, this benchmark provides promising results for the current state of 
Java Virtual Machine (JVM) technologies. The results of the A.p sparse matrix- 
vector multiplications are nearly identical between the Simple Fortran and simple 
MPJava versions, and MPJava performs within 0% of Fortran for 2D versions. 
The only optimization we performed on the A . p sparse matrix- vector multipli- 
cation code was unrolling the loop by a factor of 8, which accounted for an 
improvement of about 17% for the Simple MPJava implementation. We assume 
that the Fortran compilers already perform this optimization, as loop unrolling 
by hand had no effect on Fortran code compiled with either g77 or pgf90. 



5 Related Work 

There is a large body of work dealing with message-passing in Java. Previous 
approaches can be loosely divided into two categories: Java/native hybrids, and 
j ava . io approaches. 

JavaMPI [3] and mpiJava [2] are two efforts to provide native method wrap- 
pers to existing MPI libraries. The resulting programming style of JavaMPI is 
more complicated, and mpiJava is generally better supported. Both approaches 
provide the Java programmer access to the complete functionality of a well- 
supported MPI library such as MPICH [10]. 

This hybrid approach, while simple, does have a number of limitations. First, 
mpiJava relies on proper installation of an additional library. Next, the overhead 
of the Java Native Interface (JNI) imposes a performance penalty on native code 
which will likely make the performance of an application worse than if it were 
directly implemented in C with MPI bindings. Furthermore, the JIT compiler 
must make maximally conservative assumptions in the presence of native code 
and may miss potential optimizations. 

Most java, io implementations are based on the proposed MPJ standard of 
Carpenter et. al. [11]. However, there is no official set of MPI bindings for Java, so 
each implementation will have its own particular advantages and disadvantages. 

MPIJ, part of the Distributed Object Groups Metacomputing Architecture 
(DOGMA) project at BYU [4], is a pure-Java implementation of a large subset 
of MPI features. Their implementation is based on the proposed MPI bindings of 
Carpenter et. al. [11]. The MPIJ codebase was not available for public download 
at the time of publication. Steve Morin provides an excellent overview of MPIJ’s 
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design here [12]. We were unable to find any published results of the performance 
of MPIJ. 

The Manta project [13] supports several interesting flavors of message-passing 
codes in Java, including Collective Communication Java (CCJ) [14], Group 
Method invocation (GMI) [15], and Ibis [16]. CCJ is an RMI-based collective 
communication library written entirely in Java. It provides many features, but 
does not provide the all-to-all broadcast necessary for many scientific codes such 
as Conjugate Gradient. GMI is a generalization of Java RMI in which meth- 
ods can be invoked on a single object or on a group of objects, and results can 
be discarded, returned normally, or combined into a single result. This work is 
extremely interesting from a high-level programmatic perspective, as its fully 
orthogonal group-based design potentially allows programs that break from the 
SPMD model so dominant to MPI codes. Ibis harnesses many of the techniques 
and infrastructures developed through CCJ and GMI to provide a flexible GRID 
programming environment. 

MPI Soft Tech Inc. announced a commercial endeavor called JMPI, an ef- 
fort to provide MPI functionality in Java. However, they have yet to deliver 
a product; all we have are their design goals [17]. 

JCluster [18] is a message-passing library that provides PVM and MPI-like 
functionality in Java. The library uses threads and UDP for improved perfor- 
mance, but does not utilize java.nio. The communications are thus subject to 
the inefficiencies of the older java.io package. At the time of publication, an 
alpha version of the library was available for Windows but did not work properly 
under Linux. 

JPVM [19] is a port of PVM to Java, with syntactic and semantic modi- 
fications better suited to Java’s capabilities and programming style. The port 
is elegant, full-featured, and provides additional novel features not available to 
PVM implementations in C or Fortran. However, the lackluster performance of 
JPVM, due in large part to the older io libraries, has proved a limiting factor to 
its wide adoption. 

KARrni [20] presents a native-code mechanism for the serialization of prim- 
itive types in Java. While extremely efficient, native serialization of primitive 
types into byte arrays violates type safety, and cannot benefit from java.nio 
SocketChannels. 

Titanium [21] is a dialect of Java that provides new features useful for high- 
performance computation in Java, such as immutable classes, multidimensional 
arrays, and zone-based memory management. Titanium’s backend produces C 
code with MPI calls. Therefore the performance is unlikely to outperform native 
MPI/C applications, and could be substantially worse. 

Al-Jaroodi et. al. provide a very useful overview of the state of distributed 
Java endeavors here [22]. 

Much work has been done on GRID computing. Our work does not directly 
deal with issues important to the GRID environment, such as adaptive dynamic 
scheduling or automatic parallelism. Rather, we focus of developing an efficient 
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set of communication primitives that any GRID-aware library can be built on 
top of. 

6 Conclusion 

We have built a pure Java message-passing framework using NIO. We demon- 
strate that a message passing framework that harnesses the high-performance 
communication capabilities of NIO can deliver performance competitive with 
native MPI codes. 

We also provide empirical evidence that current Java virtual machines can 
produce code competitive with static Fortran compilers for scientific applications 
rich in floating point arithmetic. 

7 Future Work 

Though MPI supports asynchronous messages, it typically does so without the 
benefit of threads, and in a cumbersome way for the programmer. We have 
a modified version of our framework that provides the abstraction of asyn- 
chronous pipes. This is accomplished through separate send and receive threads 
that make callbacks to user-defined functions. We would like to evaluate the 
performance of our asynchronous message-passing framework for problems that 
do not easily fit into an SPMD model, such as distributed work-stealing and 
work-sharing. 

Clusters are increasingly composed of interconnected SMPs. It is typically 
not useful to schedule multiple MPI tasks on an SMP node, as the additional pro- 
cesses will fight over shared resources such as bandwidth and memory. A Java 
framework that supports the interleaving of computation and communication 
through send, receive and compute threads can better utilize extra processors 
because the JVM is free to schedule its threads on all available processors. We 
have developed a threaded version of our MPJava framework that maintains 
a send, receive and computation thread. In the CG algorithm, since each node 
only needs the entire p vector for the A.p portion of any iteration, and the 
broadcast and matrix-vector multiply step are both significant contributors to 
the total runtime, we use threads to interleave the communication and computa- 
tion of these steps. Our preliminary results were worse than the single-threaded 
results, most likely due to poor scheduling of threads by the OS and the JVM. 
The notion of interleaving computation and communication, especially on an 
SMP, is still very appealing, and requires more study. Multi-threading is an area 
where a pure-Java framework can offer substantial advantages over MPI-based 
codes, as many MPI implementations are not fully thread-safe. 

Although interest in an MPI-like Java library was high several years ago, 
interest seems to have waned, perhaps due to the horrible performance reported 
for previous implementations. Now that NIO enable high-performance commu- 
nication, it is time to reassess the interest in MPI- Java. 



338 William Pugh and Jaime Spacco 



Finally, we would like to investigate the high-level question of whether a high- 

performance message-passing framework in Java should target MPI, or should 

adhere to its own standard. 
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Abstract. The simplest semantics for parallel shared memory programs 
is sequential consistency in which memory operations appear to take 
place in the order specified by the program. But many compiler opti- 
mizations and hardware features explicitly reorder memory operations 
or make use of overlapping memory operations which may violate this 
constraint. To ensure sequential consistency while allowing for these opti- 
mizations, traditional data dependence analysis is augmented with a par- 
allel analysis called cycle detection. In this paper, we present new algo- 
rithms to enforce sequential consistency for the special case of the Single 
Program Multiple Data (SPMD) model of parallelism. First, we present 
an algorithm for the basic cycle detection problem, which lowers the run- 
ning time from 0(n 3 ) to 0(n 2 ). Next, we present three polynomial-time 
methods that more accurately support programs with array accesses. 
These results are a step toward making sequentially consistent shared 
memory programming a practical model across a wide range of languages 
and hardware platforms. 



1 Introduction 

In a uniprocessor environment, compiler and hardware transformations must ad- 
here to a simple data dependency constraint: the orders of all pairs of conflicting 
accesses (accesses to the same memory location, with at least one a write) must 
be preserved. The execution model for parallel programs is considerably more 
complicated, since each thread executes its own portion of the program asyn- 
chronously, and there is no predetermined ordering among accesses issued by 
different threads to shared memory locations. A memory consistency model de- 
fines the memory semantics and restricts the possible execution orders of memory 
operations. Of the various memory models that have been proposed, the most 
intuitive is sequential consistency, which states that a parallel execution must be- 
have as if it is an interleaving of the serial executions by individual threads, with 
each execution sequence preserving the program order [ ] . Sequential consistency 
is a natural extension of the uniprocessor execution model and is violated when 
the reordering operations performed by one thread can be observed by another 
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Parallel Program Two Illegal Executions 




By program order, T2 Reordering on T1 Reordering on T2 

cannot observe (x,y) = (0,1) X = 0, Y = 1 for T2, 

violates SC 



Fig. 1 . Violation of Sequential Consistency: The actual execution may produce results 
that would not happen if execution follows program order 



thread, and thus potentially visible to the user. Figure 1 shows a violation of 
sequential consistency due to reordering of memory operations. Although there 
are no dependencies between the two write operations in one thread or the two 
read operations in the other, if either pair is reordered, a surprising behavior 
may result, which does not satisfy sequential consistency. 

Despite its advantage in making parallel programs easier to understand, se- 
quential consistency can be expensive to enforce. A naive implementation would 
forbid any reordering of shared memory operations by both restricting compile- 
time optimizations and inserting a memory fence between every consecutive pair 
of shared memory accesses from a given thread. The fence instructions are often 
expensive, and the optimization restrictions may prevent code motion, prefetch- 
ing, and pipelining [2]. Rather than restricting reordering between all pairs of 
accesses, a more practical approach computes a subset that is sufficient to ensure 
sequential consistency. This set is called a delay set , because the second access 
will be delayed until the first has completed. Several researchers have proposed 
algorithms for finding a minimal delay set , which is the set of pairs of mem- 
ory accesses whose order must be preserved in order to guarantee sequential 
consistency [3, 4, 5]. 

The problem of computing delay sets is relevant to any programming model 
that is explicitly parallel and allows processors to access shared variables, in- 
cluding serial languages extended with a thread library and languages like Java 
with a built-in notion of threads. It is especially relevant to global address space 
languages like UPC [6], Titanium [7], and Co- Array Fortran [8], which are de- 
signed to run on machines with physically distributed memory, but allow one 
processor to read and write the remote memory on another processor. For these 
languages, the equivalent of a memory barrier thus may be a round-trip event. 
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In this paper, we focus on efficient algorithms to compute the delay sets for 
various types of Single Program Multiple Data (SPMD) programs. For example, 
given the sample code in Figure 1, the analysis would determine that neither 
pair of accesses can be reordered without violating sequential consistency. Our 
analysis framework is based on the cycle detection problem first described by 
Shasha and Snir [3]; previous work [9] showed that such analysis for SPMD 
programs can be performed in polynomial time. In this paper we substantially 
improve both the speed and the accuracy of the SPMD cycle detection algorithm 
described in [9]. By utilizing the concept of strongly connected components, we 
improve the running time of the analysis asymptotically from 0(n 3 ) to 0(n 2 ), 
where n is the number of shared memory accesses in the program. We then 
present three methods that extend SPMD cycle detection to handle programs 
with array accesses by incorporating into our analysis data dependence informa- 
tion from array indices. All three methods significantly improve the accuracy of 
the analysis for programs with loops; each differs in their relative precision and 
offers varying degrees of applicability and speed, so developers can efficiently 
exploit their tradeoffs. 

The rest of the paper is organized as follows. We formally define the problem 
in Section 2 and summarize the earlier work on it in Section 3. Section 4 de- 
scribes our improvements to the analysis’ running time, while Section 5 present 
extensions to the cycle detection analysis that significantly improve the quality 
of the results for programs with array accesses. Section 6 concludes the paper. 

2 Problem Formulation 

Our analysis is designed for shared memory (or global address space) programs 
with an SPMD model of parallelism. An SPMD program is specified by a sin- 
gle program text, which defines an individual program order for each thread. 
Threads communicate by explicitly issuing reads and writes to shared variables. 
For simplicity, we consider the program to be represented by its control flow 
graph, P. An execution of an SPMD program for n threads is a set of n se- 
quences of operations, each of which is consistent with P. An execution defines 
a partial order, which is the union of those n sequences. 

Definition 1 (Sequential Consistency). An execution is sequentially consis- 
tent if there exists a total order consistent with the execution’s partial order , 
such that the total order is a correct serial execution. 

We are interested only in the behavior of the shared memory operations, and 
thus restrict our attention to the subgraphs containing only such operations. In 
general, parallel hardware and conventional compilers will allow memory opera- 
tions to execute out of order as long as they preserve the program dependencies. 
We model this by relaxing the program orders for each thread, and instead use 
a subset of P called the delay set , D. 

Definition 2 (Sufficient Delay Set). Given a program graph P and a sub- 
graph D, D is a sufficient delay set if all executions of D are equivalent to 
a sequentially consistent execution of P. 
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All executions must now observe only the program dependencies within each 
thread and the orderings given in D. Intuitively, the delay set contains pairs of 
memory operations that execute in order. They are implemented by preventing 
program transformations that would lead to reordering and by inserting memory 
fences during code generation to ensure that the hardware preserves the order. 
A naive algorithm will take D to be the entire program ordering P, forcing 
compilers and hardware to strictly follow program order. A delay set is considered 
minimal if no strict subset is sufficient. We are now ready to state the problem 
in its most general form: 

Given a program graph P for an SPMD parallel program, find 
a sufficient minimal delay set D for P. 

3 Background 

3.1 Related Work 

Shasha and Snir [3] pioneered the study of correct execution of explicitly parallel 
programs and characterized the minimal set of delays required to preserve se- 
quential consistency. Their results are for an arbitrary set of parallel threads (not 
necessarily an SPMD program), but does not address programs with branches, 
aliases or array accesses. Midkiff and Padua [2] further demonstrated that the 
delay set computation is necessary for performing a large number of standard 
compile-time optimizations. They also extended Shasha and Snir’s characteriza- 
tion to work for programs with array accesses, but did not provide a polynomial- 
time algorithm for performing the analysis. 

Krishnamurthy and Yelick [4, 9] later showed that Shasha and Snir’s frame- 
work for computing the delay set results in an intractable NP hard problem for 
MIMD programs and proposed a polynomial-time algorithm for analyzing SPMD 
programs. They also improved the accuracy of the analysis by treating synchro- 
nization operations as special accesses whose semantics is known to the compiler. 
They also demonstrated that the analysis enables a number of techniques for op- 
timizing communication, such as message pipelining and prefetching. 

Once the delay set has been computed, sequential consistency can be en- 
forced by inserting memory barriers into the program to satisfy the delays. Lee 
and Padua [10] presented a compiler technique that reduces the number of fence 
instructions for a given delay set, by exploiting the properties of fence and syn- 
chronization operations. Their work is complementary to ours, as it assumes the 
delay set is already available, while we focus on the earlier problem of computing 
the minimal set itself. 

Recent studies have focused on data structures for correct and efficient appli- 
cation of standard compile-time optimizations for explicitly parallel programs. 
Lee et al. [5] introduced a concurrent CFG representation for summarizing con- 
trol flow of parallel code, and a concurrent SSA form that encodes sequential 
data flow information as well as information about possibly conflicting accesses 
from concurrent threads. They also showed how several classical analyses and 
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optimizations can be extended to work on the CSSA form to optimize parallel 
code without violating sequential consistency. Knoop and Steffen [11] showed 
that unidirectional bitvector analyses can be performed for parallel programs to 
enable optimizations such as code motion and dead code elimination without 
violating sequential consistency. 

з. 2 Cycle Detection 

Analyses in this paper are based on Shasha and Snir’s [3] cycle detection algo- 
rithm, which we briefly describe here. All violations of sequential consistency 
can be attributed to conflicting accesses: 

Definition 3 (Conflicting Accesses). Two shared memory operations u,v 
from different threads are said to conflict if they access the same memory loca- 
tion, and at least one of them is a write. 

Conflicting accesses are the mechanism by which parallel threads communi- 
cate, and also the means by which one thread can observe memory operations 
reordered by another. The program order P defines a partial order on indi- 
vidual threads’ memory accesses, but does not impose any restrictions on how 
operations from different threads should be interleaved, so there is not a sin- 
gle program behavior against which we can define correct reorderings. Instead, 
a happens-before relation for shared memory accesses originating from different 
threads is defined at runtime based on the time of their occurrences to fully cap- 
ture the essence of a parallel execution. Due to its nondeterministic nature, each 
instance of parallel execution defines a different happens-before relation, which 
may affect execution results depending on how it orders conflicting accesses. 

For a given parallel execution, let E be the partial order on conflicting ac- 
cesses that is exhibited at runtime, which is determined by the values returned 
by reads from writes. The graph given by P U E captures all information nec- 
essary to reproduce the results of a parallel execution: P orders accesses on the 
same thread, while E orders accesses from different threads to the same memory 
location. If there is a violation of sequential consistency, then for two accesses 

и, v issued by the same thread, both (it, v) and (v, u) are related in PU E. Viewed 
as a graph, such a situation occurs exactly when PUP contains a cycle that 
includes E edges. 1 Since we cannot predict at compilation time which access in 
a conflicting pair will happen first, we approximate E by C , the conflict relation 
which is a superset of E and contains all pairs of conflicting accesses. The conflict 
relation is irrcflexive, symmetric, and not transitive, and can be represented in 
a graph as bidirectional edges between two conflicting accesses. 

The goal of Shasha and Snir’s analysis is thus to perform cycle detection on 
the graph PUC of a parallel program. Their algorithm uses the notion of critical 
cycle to find the minimal delay set necessary for sequential consistency: 

Definition 4 (Critical Cycle). A critical cycle in PUC is a simple cycle with 
the property that for any two non-adjacent nodes u, v in the cycle, (it, v) (f P. 



1 Intrinsic cycles in P due to loops are ruled out. 
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(initially x = y = 0) 

PI P2 




P edges 
C edges 
Delays 



Fig. 2. Computing the Delay Set 



In other words, when detecting cycles we always attempt to find minimal 
cycles, and a critical cycle can have at most two (successive) nodes in any thread. 
Shasha and Snir proved the following theorem [3] that the P edges in the set of 
critical cycles form a delay set that guarantees sequential consistency: 

Theorem 1 (Existence of a Minimal Delay Set for SC). Let D be the 

set of edges (it, v) in straight-line code, where (it, v) € P is part of a critical 
cycle. Then any execution order that preserves delay D is sequentially consistent; 
furthermore, the set D is minimal. 

Figure 2 shows how a critical cycle can be used to compute the minimal delay 
set for sequential consistency, for the sample code from Figure 1. 



3.3 Cycle Detection for SPMD Programs 

Detecting critical cycles for an arbitrary program order P, unfortunately, is NP- 
hard as the running time is exponential in the number of threads. Krishnamurthy 
and Yelick [9] proposed a polynomial time algorithm for the common special 
case of SPMD programs, taking advantage of the fact that all threads execute 
identical code. Their algorithm, explained in detail in [ 2], works as follows: 

Definition 5 (Conflict Graphs for SPMD Programs). Consider Pi, P r to 
be two copies of the original program P , so that ui £ Pi and u r £ P r if u £ P. 
Define C to be the set of conflicting accesses, and 



Ti = {(m,Vr),(vi,Ur)\(u, v) £ C] (1) 

T 2 = {(u r ,v r )\{u,v) £ C} (2) 

T3 = {(u r ,V r )\(u,v) £ P} (3) 

CG = Ti U T 2 U T 3 (4) 



The graph CG, named the conflict graph, will also be used in other analyses 
described later in this paper. The right side of the conflict graph P r is identical 
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Input: P and C of a SPMD program 
Output: delay set for P 

1. Construct CG following the descriptions in Definition 5; 

2. For every u i £ P, perform a breadth-first search with the vertex as root; 

3. Check for every ( u , v) £ P whether ui is reachable starting from vi in CG, 
using results from step 2. If yes, then (u, v ) belongs to the delay set. 

Algorithm 1: Krishnamurthy and Yelick’s Algorithm for SPMD Cycle De- 
tection 



to P U C, while the left side Pi has no internal edges and connects to the right 
side via the conflict edges. Krishnamurthy and Yelick described an algorithm 
that computes the delay set by detecting a back-path in the transformed graph 
for each P edge ( ui,vi ) and proved the following theorem in [9]: 

Theorem 2 (Cycle Detection for SPMD Programs). For an edge (u,v) £ 
P, if there exists a path from vi to ui in CG, then ( u,v ) belongs to the minimal 
delay set. Furthermore, the delay set computed is the same as the one defined in 
Theorem 1. 

Based on the above theorem, they claimed that cycle detection for SPMD 
programs can be performed in 0(n 3 ) time (Algorithm 1), where n is the number 
of shared memory accesses in P. 

4 A Faster Algorithm for SPMD Cycle Detection 

In this section, we show a slight modification of Krishnamurthy and Yelick’s 
algorithm that can compute the identical delay set in 0(n 2 ) time. Algorithm 1 
is easy to understand but inefficient due to the breadth-first search required for 
each node. Instead, we can improve its running time by using strongly connected 
components (SCC) to avoid the redundant computations performed for each 
node. Note that proofs to theorems presented in this paper have been omitted 
due to space constraints; interested readers can refer to them in our technical 
report [13] that contains the full version of the paper. 

Our algorithm is similar to the one proposed in [14] in that both rely on the 
concept of strong connnectivity; an important distinction, however, is that we 
do not require initialization writes for every variable. If all accesses are read- 
only, step 3 fails due to the absence of conflicts, and no edges will be added to 
the delay set. This difference is vital if we want to combine the algorithm with 
synchronization analysis of barriers, since it is common for SPMD variables to 
be read-only in some phases of the program. Before proving Algorithm 2, we first 
explain the claim in step 3 that all conflicting accesses of a node will belong to 
the same strongly connected component in P r . Consider a node u and any two 
of its conflicting accesses vA, vJ2: Since there exist bidirectional edges between 
it, vA and it, i>_2 in T_2 (the C edges), it is clear that they all belong to the same 
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Input: P and C of a SPMD program 
Output: delay set for P 

1. Create the graph P r as appeared in Definition 5, by taking PUC; 

2. Identify the strongly connected components in P r \ 

3. For every node u € P, find the strongly connected component SCC U that u’s 
conflicting accesses belong to. (We will prove that they must all be in the same 
SCC.); 

4. For each (u, v) £ P, if there is a path from SCC V to SCC U in the direct 
acyclic graph of SCCs, we add («, v) to the delay set. 

Algorithm 2: A 0(n 2 ) Algorithm for Computing Delay Set 



SCC. We can now show that for a SPMD program this modified algorithm is 
equivalent to Algorithm 1, and calculate its running time: 

Theorem 3. Algorithm 2 returns the same delay set as Algorithm 1 for any 
SPMD program. 

Proof. The proof can be found in [13]. 

Theorem 4. Algorithm 2 runs in 0{n 2 ) time, where n is the number of shared 
accesses in P. 

Proof. The proof can be found in [13]. 



5 Extending SPMD Cycle Detection for Array Accesses 

Another area in which the SPMD cycle detection algorithm can be improved 
is the quality of the delay set for array accesses. Although Theorem 2 states 
that the delay set computed by the algorithm is “minimal” , the claim holds only 
for straight-line code with perfect alias information. The algorithm is therefore 
overly conservative when analyzing array accesses in loops; every P edge inside 
a loop can be included in the delay set, as a back-path can be constructed using 
the loop’s back edge. This has an undesirable effect on performance, as the false 
delays can thwart useful loop optimizations such as loop-invariant code motion 
and software pipelining. 

In this section, we present an analysis framework that extends SPMD cycle 
detection to handle array accesses. After describing an existing approach that 
requires exponential time in Section 5.1, we present three polynomial-time algo- 
rithms that could significantly reduce the number of delays inside loops. While 
all three techniques collect information from array subscripts to make the analy- 
sis more precise, they differ in their approaches for representing and propagating 
information: classical graph theory algorithms, data-flow analysis, and integer 
programming methods. The choice of the three methods largely depends on the 
amount of information that can be statically extracted from the array subscripts; 



348 



Wei- Yu Chen et al. 



if (MYTHREAD == 1) 
for (i = 0; i < N; i+= 3) { 
A[i] = cl; (SI) 
B[i+1] = c2; (S2) 

} 

if (MYTHREAD == 2) 
for (j = 2; j < N; j+=2) { 
B[j] = c3; (S3) 

A[j-2] = c4; (S4) 

} 




SI -> S4: i = j - 2 

51 -> S2: i' = i + 3k„ k, >= 0 

52 -> SI : i = i’ + 3k,,k 1 >= 1 



52 -> S3: i + 1 = j’ 

53 -> S4: j = j’ + 2k 2 , k 2 >= 0 

54 -> S3: j’ = j + 2k 2 , k 2 >= 1 



4 



Fig. 3. Conflict Graph with Corresponding Constraints 



for instance, the data-flow analysis approach sacrifices some precision for a more 
efficient algorithm, and the integer programming techniques supports complex 
affine array expressions at the cost of increased complexity. 

For simplicity, we consider nested well-behaved loops in C, where each di- 
mension of the loop is of the form for(i = init; cond(i)', i+ = k){loopJ>ody}, 
with the following provisions: both k and loopJbody may be different for each 
thread, the loop index i is not modified in the loop body, and array subscripts 
are affine expressions of loop indices. While the definition may seem restrictive, 
in practice loops in scientific applications with regular access patterns frequently 
exhibit this characteristic. We further assume that the base address of the array 
access is a constant, and that different array variables do not overlap (i.e. , arrays 
but not pointers in C). 

5.1 Existing Approach 

Midkiff et al. [15] proposed a technique that extends Shasha and Snir’s analysis 
to support array accesses. Under their approach, every edge of the conflict graph 
(named a s-level graph in their work) is associated with a linear constraint that 
relates the array subscripts on its two nodes. A conflict edge generates an equality 
constraint, since the existence of a conflict implies the subscripts must be equal. 
Also, the constraint of each conflict edge will use a fresh variable for the loop in- 
dex, as in general the conflicts can happen in different iterations. The constraint 
for a P edge is only slightly more complicated. Consider (A[f(i)],B[g(i’)]) £ P, 
where i and i' represent possibly different loop index values, and / and g are 
affine functions. From the definition of P we could immediately derive i' = i + k\ 
where k\ is a multiple of the loop increment, since A[f(i)] happens first by pro- 
gram order. The inequality constraint for k\ depends on the context of the P 
edge; we specify k-\ > 1 if it is the back edge, and k\ > 0 otherwise. Figure 3 
shows the constraints generated by each edge in the sample graph. 
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if (MYTHREAD == 1) 
for (i = 1; i < N; i++) { 
A[i] = 1 ; (SI) 
B[i+1] = 2;(S2) 

} 

if (MYTHREAD == 2) 
for (i = 1; i < N; i++) { 
B[i] = 3; (S3) 
A[i-1] = 4; (S4) 



P edges 
C edges 




2 



Fig. 4. Adding Edge Weights for Cycle Detection 



Once the constraints are specified, the next step is to generate all cycles in the 
graph that may correspond to critical cycles (Definition 4) . Midkiff et al. showed 
that the problem can be reduced to finding solutions to the linear system formed 
by the constraints of every edge in the cycle; a delay is necessary only for edges 
on a cycle with a consistent linear system. If a cycle satisfies the criteria, a final 
filtering step is applied to see if it can be discarded because it shares common 
solutions with another smaller cycle. 

While their technique successfully incorporates dependence information to 
improve accuracy of the analysis, its applicability appears limited due to two 
factors. First, it does not specify how to generate the cycles in the conflict graph; 
the total number of (simple and non-simple) cycles is exponential in the num- 
ber of nodes, so a brute force method that examines all is clearly not practical. 
Another limitation is the cost of solving each linear system, which is equiva- 
lent to integer linear programming, a well-known NP-complete problem. Since 
a cycle can contain 0(n) edges and thus constraints, solving the system again 
requires exponential time. As a result, in the next section we will present several 
polynomial-time algorithms that make cycle detection practical for programs 
with loops. 



5.2 Polynomial-Time Cycle Detection for Array Accesses 

Our analysis framework combines array dependence information with the conflict 
graph, except that we assign each P edge with an integer weight equal to the 
difference between the array subscripts of its two nodes. Scalars can be considered 
as array references with a zero subscript. Also, an optional preprocessing step 
can apply affine memory disambiguation techniques [16] to eliminate conflict 
edges between independent array accesses. Figure 4 illustrates this construction 2 , 
where the two edges in the loop body receive weights of 1 and —1, and the back 
edges are assigned the value of 0 and 2 to reflect both the difference between 

2 We showed only the right part of the conflict graph, as the left part remains un- 
changed 
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Input: P and C of a SPMD program 
Output: delay set for P 

Construct CG following the descriptions in Definition 5; 

Annotate each P edge in CG with its corresponding weight; 
foreach (u, v) £ P do 

Add ( ui,vi ) to Pi, with its edge weight; 

Run the zero cycle detection algorithm from [19] on CG; 

If ( ui,vi ) is part of a zero cycle, add it to the delay set; 
end 

Algorithm 3: Handling Array Accesses Through Zero Cycle Detection 



the array subscripts and the increment on the loop index variable after each 
iteration. Conflict edges always have zero weight, as the presence of a conflict 
implies the two array subscripts must be equal. For an edge ( ui,vi ) £ P, the 
goal of the analysis is not only to detect a path from Vi to ui in the conflict 
graph, but also to verify that the back-path together with the edge forms a (not 
necessarily simple) cycle with zero weight: 

Theorem 5 (Cycle Detection with Weighted Edges). With the above con- 
struction, an edge ( ui,vi ) £ P is in the delay set if it satisfies the conditions in 
Theorem 2, and W(ui,v{) + W(backpath(vi,ui )) = 0, where W(e) is the weight 
of edge e. 

Proof. The proof can be found in [13]. 



Zero Cycle Detection: If all edge weights are compile-time constants, The- 
orem 5 can be reduced to the problem of finding zero cycles in the conflict 
graph. On the surface the reduced problem still seems difficult to solve, as find- 
ing a simple cycle with zero total weight is known to be NP-complete. For our 
purposes, however, we are interested in finding zero cycles that need not be 
simple, as a zero cycle that visit a node multiple times conveys a delay due to 
conflicts among array accesses in different iterations. Several studies [17, 18] have 
presented recurrence equations and linear programming techniques to solve the 
general form of the ZERO-CYCLE problem, which determines for a graph G 
with k-dimensional vector edge weights if it contains a cycle whose weight sums 
to the zero vector. In particular, Cohen and Megicldo [ L9] proved that zero cycle 
detection for a graph with fixed k can be performed in polynomial time; they 
further showed that the special case of k = 1 can be answered in 0(n 3 ) time 
using a modified all pairs shortest path algorithm, where n is the number of 
nodes. Algorithm 3 computes the delay set based on this result. 

As each invocation of the zero cycle detection algorithm takes 0(n 3 ) time, 
this algorithm unfortunately has a running time of 0(n 5 ). The loss in efficiency 
is compensated, however, by obtaining a much more accurate delay set. Figure 5 
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for (i = 0; i < N; i++) 
if ( cond ) { 

A[i] = 1 ; (SI) 
B[i] = 2; (S2) 

} 

//each thread runs 
//same code 



P edges 
C edges 



T1 T2 




1 



Fig. 5. SPMD Code for which Algorithm 3 Is More Accurate Than Algorithm 1 



demonstrates the analysis’s benefit: while plain SPMD cycle detection (Algo- 
rithm 1) will incorrectly include every P edge in the delay set due to spurious 
cycles created by the loop back edge, Algorithm 3 can accurately eliminate these 
unnecessary delays. Another benefit of this algorithm is that it can be easily ex- 
tended to support multidimensional arrays. For a k-dimensional iteration space, 
we simply construct CG using k-dimensional vectors as its edge weights, with 
each element corresponding to a loop index variable. As the level of loop nests 
in real programs rarely exceed 4 or 5, this more complex scenario can still be 
solved in the same asymptotic time as the scalar-weight case. 



Data-Flow Analysis Approximation: The major limitation of Algorithm 3 
is that edge weights in general may not be compile-time constants; for example, it 
is common in scientific code to have a loop performing strided array accesses with 
an either dynamic or run-time constant stride value. The signs of the weights, 
however, are usually statically known, and using abstract interpretation tech- 
niques [20] we can deduce the sign of a cycle’s weight sum. If every edge of the 
cycle has the same sign, it can never be a zero cycle; otherwise we conservatively 
assume that it may satisfy the conditions in Theorem 5. Algorithm 4 generalizes 
this notion by applying data-flow analysis with the lattice and flow equations 
in Figure 6 to estimate the weight sum of each potential cycle. For each P edge 
(it,u), sgn(w) represents the possible sign of any paths from u to w, therefore, 
if ui is reachable from Vi (indicating a back-path) and sgn(ui) is either + or — , 
by definition (it, v) will not be part of any zero cycle. 

This approach is a sound but conservative approximation of the zero cycle 
detection problem, and thus may compute some false positive delays. While it 
gives the same result as Algorithm 3 for Figure 4 (delays) and 5 (no delays), 
a more complicated example in Figure 3 illustrates their differences. Although 
the zero cycle detection algorithm correctly concludes that sequential consistency 
could never be violated there due to the absence of zero cycles, Algorithm 4, 
affected by the negative edge from S3 to ST, will conservatively place every P 
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T 

• OUT(B) = IN(B) 

• IN(B) = U (Sgn(P,B) U OUT(P)), 
where P is a predecessor of B. 

0 

Fig. 6. Lattice and Flow Equations for Algorithm 4 





edge in the delay set. For the common cases of loops with monotonic array 
subscripts, however, this analysis is as accurate as the one in the previous section. 

Since the lattice has a height of two, the data-flow analysis step will finish in 
at most 0(n 2 ) time. As the analysis step needs to be done for each P edge, it 
appears that we have a 0(n 4 ) algorithm. The insight here, however, is that when 
initializing the data-flow analysis for an edge ( u,v ), Vi can take only one of the 
three different values; it thus suffices to run the data-flow analysis three times 
for each node in the graph to cover all possible initial conditions of the analysis. 
So Algorithm 4 has a worst-case 0(n 3 ) time bound. Extensions of this approach 
to support nested loops is straightforward; we can run the analysis separately 
for each dimension, and add an edge to the delay set only when all dimensions 
return a sign of either 0 or T . 

Integer Programming Based Method: In the most general case, array sub- 
scripts will be affine expressions with arbitrary constant coefficients and symbolic 
terms, so the previous methods are no longer applicable as neither the value nor 



Input: P and C of a SPMD program, with weighted edges 
Output: delay set for P 

Construct CG following the descriptions in Definition 5; 

foreach ( u , v) £ P do 

Initialize sgn{vi) to be the sign of W(u,v) (one of 0,+,—), and sgn(w ) to 
be 0 for all other nodes w, 

Apply data-flow analysis starting from vi until no nodes have their signs 
changed; 

If u i is reachable from vi and sgn{ui) £ {0, T}, then add ( u , v) to the delay 
set. 

end 



Algorithm 4: Handling Array Accesses Using Data-flow Analysis 



